the "big data" ecosystem at linkedin

The "Big Data" Ecosystem at LinkedInSIGMOD 2013Roshan Sumbaly, Jay Kreps, & Sam ShahJune 2013

LinkedIn: the professional profile of record

225MMembers 225M MemberProfiles

Applications

Application examples

People You May Know (2 people) Year In Review Email (1 person, 1 month) Skills and Endorsements (2 people) Network Updates Digest (1 person, 3

months) Who’s Viewed My Profile (2 people) Collaborative Filtering (1 person) Related Searches (1 person, 3 months) and more…

Skill sets

Rich Hadoop-based ecosystem

“Last mile” problems

Ingress– Moving data from online to offline system

Workflow management– Managing offline processes

Egress– Moving results from offline to online

systems Key/Value Streams OLAP

People You May Know

People You May Know – Workflow

Perform triangle closing for all

members

Triangle closing

Rank by discounting previously shown recommendations

Push recommendations to

online service

Connection stream

Impression stream

Ingress - O(n2) data integration complexity

Point to point Fragile, delayed and potentially lossy Non-standardized

Ingress - O(n) data integration

Ingress – Kafka

Distributed and elastic– Multi-broker system

Categorized topics– “PeopleYouMayKnowTopic”– “ConnectionUpdateTopic”

Ingress

Standardized schemas– Avro– Central repository– Programmatic compatibility

Audited ETL to Hadoop

Egress– Moving results form offline to online

members

online service

Connection stream

Impression stream

People You May Know – Workflow (in reality)

Workflow Management - Azkaban

Dependency management– Historical logs

Diverse job types– Pig, Hive, Java

Scheduling Monitoring Visualization Configuration Retry/restart on failure Resource locking

members

online service

Connection stream

Impression stream

Member Id 1213 => [ Recommended member id 1734,

Recommended member id 1523 … Recommended member id 6332 ]

Egress – Key/Value

Voldemort– Based on Amazon’s Dynamo

Distributed and Elastic Horizontally scalable Bulk load pipeline from Hadoop Simple to usestore results into ‘url’ using KeyValue(‘member_id’)

People You May Know - Summary

Year In Review Email

Year In Review EmailmemberPosition = LOAD '$latest_positions' USING BinaryJSON;memberWithPositionsChangedLastYear = FOREACH ( FILTER memberPosition BY ((start_date >= $start_date_low ) AND (start_date <= $start_date_high))) GENERATE member_id, start_date, end_date;

allConnections = LOAD '$latest_bidirectional_connections' USING BinaryJSON;

allConnectionsWithChange_nondistinct = FOREACH ( JOIN memberWithPositionsChangedLastYear BY member_id, allConnections BY dest ) GENERATE allConnections::source AS source, allConnections::dest AS dest;

allConnectionsWithChange = DISTINCT allConnectionsWithChange_nondistinct;

memberinfowpics = LOAD '$latest_memberinfowpics' USING BinaryJSON;pictures = FOREACH ( FILTER memberinfowpics BY ((cropped_picture_id is not null) AND ( (member_picture_privacy == 'N') OR (member_picture_privacy == 'E'))) ) GENERATE member_id, cropped_picture_id, first_name as dest_first_name, last_name as dest_last_name;

resultPic = JOIN allConnectionsWithChange BY dest, pictures BY member_id;connectionsWithChangeWithPic = FOREACH resultPic GENERATE allConnectionsWithChange::source AS source_id, allConnectionsWithChange::dest AS member_id, pictures::cropped_picture_id AS pic_id, pictures::dest_first_name AS dest_first_name, pictures::dest_last_name AS dest_last_name;

joinResult = JOIN connectionsWithChangeWithPic BY source_id, memberinfowpics BY member_id; withName = FOREACH joinResult GENERATE

connectionsWithChangeWithPic::source_id AS source_id, connectionsWithChangeWithPic::member_id AS member_id, connectionsWithChangeWithPic::dest_first_name as first_name, connectionsWithChangeWithPic::dest_last_name as last_name, connectionsWithChangeWithPic::pic_id AS pic_id, memberinfowpics::first_name AS firstName, memberinfowpics::last_name AS lastName, memberinfowpics::gmt_offset as gmt_offset, memberinfowpics::email_locale as email_locale, memberinfowpics::email_address as email_address;

resultGroup = GROUP withName BY (source_id, firstName, lastName, email_address, email_locale, gmt_offset);

-- Get the count of results per recipientresultGroupCount = FOREACH resultGroup GENERATE group , withName as toomany, COUNT_STAR(withName) as num_results;resultGroupPre = filter resultGroupCount by num_results > 2;resultGroup = FOREACH resultGroupPre { withName = LIMIT toomany 64; GENERATE group, withName, num_results;}

x_in_review_pre_out = FOREACH resultGroup GENERATE FLATTEN(group) as (source_id, firstName, lastName, email_address, email_locale, gmt_offset), withName.(member_id, pic_id, first_name, last_name) as jobChanger, '2013' as changeYear:chararray, num_results as num_results;

x_in_review = FOREACH x_in_review_pre_out GENERATE source_id as recipientID, gmt_offset as gmtOffset, firstName as first_name, lastName as last_name, email_address, email_locale, TOTUPLE( changeYear, source_id,firstName, lastName, num_results,jobChanger) as body;

rmf $xir;STORE x_in_review INTO '$url' USING Kafka();

Year In Review Email – Workflow

Find users that have changed jobs

Join with connections and metadata

(pictures)

Group by connections of these users

Push content to email service

Egress - Streams

Service acts as consumer “EmailContentTopic”store emails into ‘url’ using Stream(“topic=x“)

Conclusion

Hadoop: simple programmatic model, rich developer ecosystem

Primitives for – Ingress:

Structured, complete data available Automatically handles data evolution

– Workflow management Run and operate production processes

– Egress 1-line command for data for exporting data Horizontally scalable, little need for capacity planning

Empowers data scientists to focus on new product ideas, not infrastructure

Future work: models of computation

• Alternating Direction Method of Multipliers (ADMM)

• Distributed Conjugate Gradient Descent (DCGD)• Distributed L-BFGS• Bayesian Distributed Learning (BDL)

Graphs

Distributed learning

Near-line processing

data.linkedin.com

the "big data" ecosystem at linkedin

id people

people year

summary people

people collaborative

application examples

linkedin corporation

people network updates

offline processes egress

Technology

the bigdata ecosystem at linkedin -...

big data and data standardization at linkedin

linkedin content ecosystem by david hahn

wprowadzenie do technologii big data / intro to big data...

the bigdata ecosystem at linkedin€¦ · • based on the...

linkedin: an evolving platform for big brands

sustaining the big data ecosystem

entrepreneurship ecosystem (linkedin)

linkedin techconnect 13: linkedin content ecosystem

big data linkedin 10010807 10010838 10010845

surveymonkey & linkedin: big data, um enfoque de rh

big data and hadoop ecosystem

the little big book of linkedin advertising

value of linkedin\'s ecosystem

pentaho & big data ecosystem 2013

welcome to big @ linkedin!

the "big data" ecosystem at linkedin

big data ecosystem

big data ecosystem @ linkedin

unit 5: big data ecosystem - colvee