Transcript
Page 1: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Page 2: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Vijay Aruswamy,

Staff Engineer, Big Data Operations,

LinkedIn Corporation

https://www.linkedin.com/in/vijayaruswamy

2

Page 3: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Outline

LinkedIn Overview

Why Data is important for LinkedIn

Linkedin’s Big Data Eco-System

How Automic tools are helping LinkedIn

3

Page 4: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Our Mission

Connect the world's professionals to make

them more productive and successful.

4

Page 5: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 5

Page 6: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

LinkedIn – Worlds Largest Professional Network

Page 7: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Outline

LinkedIn Overview

Why Data is important for LinkedIn

Linkedin’s Big Data Eco-System

How Automic tools are helping LinkedIn

7

Page 8: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

“What Gets measured, gets fixed”-David Henke, Former SVP Operations, LinkedIn

8

Page 9: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 9

Page 10: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Few Data Driven Products

People You May Know (PYMK)

Companies you may be Interested

Jobs you may be interested

Groups you may like

Who Viewed your profile

Economic Graph Challenge

10

Page 11: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Few Data Driven Products

11

People You May Know (PYMK)

Companies you may be Interested

Jobs you may be interested

Groups you may like

Who Viewed your profile

Economic Graph Challenge

Page 12: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Few Data Driven Products

12

People You May Know (PYMK)

Companies you may be Interested

Jobs you may be interested

Groups you may like

Who Viewed your profile

Economic Graph Challenge

Page 13: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Few Data Driven Products

13

People You May Know (PYMK)

Companies you may be Interested

Jobs you may be interested

Groups you may like

Who Viewed your profile

Economic Graph Challenge

Page 14: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Few Data Driven Products

14

People You May Know (PYMK)

Companies you may be Interested

Jobs you may be interested

Groups you may like

Who Viewed your profile

Economic Graph Challenge

Page 15: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Few Data Driven Products

15

People You May Know (PYMK)

Companies you may be Interested

Jobs you may be interested

Groups you may like

Who Viewed your profile

Economic Graph Challenge

Page 16: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Outline

LinkedIn Overview

Why Data is important for LinkedIn

Linkedin's Big Data Eco-System

How Automic tools are helping LinkedIn

16

Page 17: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 17

Page 18: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Type of Data at “LinkedIn”

Behavioral Data

18

Identity Data Social Data+ +

Page 19: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

What does “Big Data” mean at LinkedIn

19

Analytical Challenges & Complexity

Data

Volume

+ ∞

+ ∞

Social Media Data

Web/Behavior

Data

CRM Data

Member Data

Page 20: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 20

High Level Data Flow

Page 21: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Camus

Camus is a MapReduce job to load data from Kafka into HDFS. It is capable of

incrementally copying data from Kafka into HDFS

http://etl.svbtle.com/setting-up-camus-linkedins-kafka-to-hdfs-pipeline

21

Unified data ingestion system for internal and external data sources. Gobblin uses a worker framework where each records run through the four stages of extraction, conversion, quality checking before writing.

https://engineering.linkedin.com/data-ingestion/gobblin-big-data-ease

Gobblin

Page 22: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 22

High Level Data Flow Cont..

Page 23: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Automic

Data driven scheduling - A process will not execute before the data dependency is

satisfied.

Typical time series roll-up hierarchy (hour :: day :: week :: month :: quarter :: year) are

handled by Azkaban

Processes should execute only when the input data sets are available

23

Grouping -Organize components and workflows into common area for maintenance,

enhancements

Supports External dependencies

Use of Global Variable –Keep storing commonly used password in one place.

Throttling --Assign Jobs to Queues, Schedule when jobs are to run throughout the day,

Hold jobs under the same flow

Load Balancing --Assign queues to run on a particular server

Monitoring --Graphical Explorer

Azkaban

Page 24: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

LinkedIn’s Big Data Architecture

Online DBs - Prod DCs

Espress

o

Service Metrics

Web Tracking

Page 25: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 25

LinkedIn’s Application Manager

Page 26: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Type of jobs scheduled by Automic

External ETL

ODS ETL

Hadoop ETL

Teradata ETL

User Input ETL

Historical Loads

One-time data fixes

26

Page 27: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 27

Page 28: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Data Volume

28

How many Kafka topics (tracking + service) do we dump on Hadoop?

– ~ 900+, Tracking : 300 (/data/tracking) + Service : 682 (/data/service)

– Data size/day of above?

~10 TB

How many online DB tables do we have on Hadoop?

– ~300+ (Oracle, Espresso, MySql) tables

– Data size?

~8 TB

Capacity of DWH on Teradata

– ~186 TB overall with 6 month retention, ~3 TB every day

– ~340k unique queries/day (248k from users and ~ 90K from ETL)

Capacity of Hadoop

– Biggest cluster 5 PB with 2500+ nodes

– ETL clusters 3.1 PB with 360+ nodes

Page 29: How Linkedin uses Automic for Big Data Processes

ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.

Q & A

29


Top Related