ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Vijay Aruswamy,
Staff Engineer, Big Data Operations,
LinkedIn Corporation
https://www.linkedin.com/in/vijayaruswamy
2
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Outline
LinkedIn Overview
Why Data is important for LinkedIn
Linkedin’s Big Data Eco-System
How Automic tools are helping LinkedIn
3
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Our Mission
Connect the world's professionals to make
them more productive and successful.
4
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 5
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn – Worlds Largest Professional Network
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Outline
LinkedIn Overview
Why Data is important for LinkedIn
Linkedin’s Big Data Eco-System
How Automic tools are helping LinkedIn
7
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
“What Gets measured, gets fixed”-David Henke, Former SVP Operations, LinkedIn
8
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 9
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
People You May Know (PYMK)
Companies you may be Interested
Jobs you may be interested
Groups you may like
Who Viewed your profile
Economic Graph Challenge
10
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
11
People You May Know (PYMK)
Companies you may be Interested
Jobs you may be interested
Groups you may like
Who Viewed your profile
Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
12
People You May Know (PYMK)
Companies you may be Interested
Jobs you may be interested
Groups you may like
Who Viewed your profile
Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
13
People You May Know (PYMK)
Companies you may be Interested
Jobs you may be interested
Groups you may like
Who Viewed your profile
Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
14
People You May Know (PYMK)
Companies you may be Interested
Jobs you may be interested
Groups you may like
Who Viewed your profile
Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Few Data Driven Products
15
People You May Know (PYMK)
Companies you may be Interested
Jobs you may be interested
Groups you may like
Who Viewed your profile
Economic Graph Challenge
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Outline
LinkedIn Overview
Why Data is important for LinkedIn
Linkedin's Big Data Eco-System
How Automic tools are helping LinkedIn
16
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 17
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Type of Data at “LinkedIn”
Behavioral Data
18
Identity Data Social Data+ +
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
What does “Big Data” mean at LinkedIn
19
Analytical Challenges & Complexity
Data
Volume
+ ∞
+ ∞
Social Media Data
Web/Behavior
Data
CRM Data
Member Data
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 20
High Level Data Flow
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Camus
Camus is a MapReduce job to load data from Kafka into HDFS. It is capable of
incrementally copying data from Kafka into HDFS
http://etl.svbtle.com/setting-up-camus-linkedins-kafka-to-hdfs-pipeline
21
Unified data ingestion system for internal and external data sources. Gobblin uses a worker framework where each records run through the four stages of extraction, conversion, quality checking before writing.
https://engineering.linkedin.com/data-ingestion/gobblin-big-data-ease
Gobblin
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 22
High Level Data Flow Cont..
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Automic
Data driven scheduling - A process will not execute before the data dependency is
satisfied.
Typical time series roll-up hierarchy (hour :: day :: week :: month :: quarter :: year) are
handled by Azkaban
Processes should execute only when the input data sets are available
23
Grouping -Organize components and workflows into common area for maintenance,
enhancements
Supports External dependencies
Use of Global Variable –Keep storing commonly used password in one place.
Throttling --Assign Jobs to Queues, Schedule when jobs are to run throughout the day,
Hold jobs under the same flow
Load Balancing --Assign queues to run on a particular server
Monitoring --Graphical Explorer
Azkaban
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
LinkedIn’s Big Data Architecture
Online DBs - Prod DCs
Espress
o
Service Metrics
Web Tracking
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 25
LinkedIn’s Application Manager
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Type of jobs scheduled by Automic
External ETL
ODS ETL
Hadoop ETL
Teradata ETL
User Input ETL
Historical Loads
One-time data fixes
26
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved. 27
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Data Volume
28
How many Kafka topics (tracking + service) do we dump on Hadoop?
– ~ 900+, Tracking : 300 (/data/tracking) + Service : 682 (/data/service)
– Data size/day of above?
~10 TB
How many online DB tables do we have on Hadoop?
– ~300+ (Oracle, Espresso, MySql) tables
– Data size?
~8 TB
Capacity of DWH on Teradata
– ~186 TB overall with 6 month retention, ~3 TB every day
– ~340k unique queries/day (248k from users and ~ 90K from ETL)
Capacity of Hadoop
– Biggest cluster 5 PB with 2500+ nodes
– ETL clusters 3.1 PB with 360+ nodes
ORGANIZATION NAME©2013 LinkedIn Corporation. All Rights Reserved.
Q & A
29