the big data analytics ecosystem at linkedin

The Big Data Analytics Ecosystem at LinkedIn

Rajappa IyerSeptember 17, 2013

Agenda

LinkedIn by the numbers An Overview of Data Driven Products /

Insights The Big Data Analytics Ecosystem

– Storage and Compute Platforms– Data Transport Pipelines– Data Processing Pipelines– Operational Tooling - Metadata

Q&A

LinkedIn: The World’s Largest Professional Network

Members Worldwide

2 newMembers Per Second

100M+Monthly Unique Visitors

238M+ 3M+ Company Pages

Connecting Talent Opportunity. At scale…

Insights

(Analysts and Data Scientists)

Data Driven Products and Insights

Products for Members

(Professionals)

Products for Enterprises

(Companies)

Data,Platforms,Analytics

Products for Members

Products for Enterprises

Sell - Sales Navigator Market - Marketing Solutions

Hire - Talent Solutions

Examples of Business Insights

Example of Deeper Insight

Job Migration After Financial Collapse

A Simplified Overview of Data Flow

LinkedIn Confidential ©2013 All Rights Reserved 10

Storage and Compute Platforms

Most data in Avro format Access via Hive and Pig

Most ETL processes run hereSpecialized batch processing

Algorithmic data mining


Storage and Compute Platforms

Integrated Data Warehouse

Standard BI Tools

Interactive Querying(Low latency)

Workload Management


Transport Pipeline - Kafka

High-volume, low-latency messaging system

Horizontally scalable Automatic load balancing Rewindability Intra-cluster replication Mainly used for log

aggregation and queuing


Transport Pipeline - Databus

Timeline consistent data change capture

Works with Oracle, MySQL, Espresso… Transactional semantics In-order, at least once delivery Low latency Has scaled to 100s of sources


Processing Pipeline: Camus

Camus: Framework for ingesting Kafka streams to HDFS


Camus: Features

Highly scalable due to adaptive input format

– Handled 10x increase in data volume without change

Restartable with checkpointing Robust auditing support Plays nicely with Hive and Pig

– Avro format support– Hive metastore registration

Open source– https://github.com/linkedin/camus

https://github.com/linkedin/camus

https://github.com/linkedin/camus


Processing Pipeline: Lumos

Lumos: Framework to ingest database data to HDFS

PRODOracle

VirtualSnapshot

Materializer

ETL Hadoop Cluster

Staging Data(internal)

Data-BusDB

Extract

LazySnapshot

Materializer

ExternalData

Inc/Full(internal)

DWHprocesses

Meta-Data

PublishedVirtual Snapshot

Pig/HiveLoaders

PRODEspresso


Lumos: Features

Supports Espresso, Oracle and MySQL as sources

Full snapshots and incremental dumps Automatic type translation for most database

types Provides LAST UPDATE semantics for data Supports low latency requirements

– Reader API performs just-in-time compaction Snapshot constructed two ways:

– On demand compaction for upserts– Periodic snapshotting that reflects deletes as well

Operational Support - Metadata

ETL pipeline is a complex graph of workflows

– Our comprehensive dashboard production flow is nearly 30 levels deep with complex dependencies

To manage this, we needed to capture:– Process dependencies– Data dependencies– Process execution history– Data load status– Data consumption status (watermarks)

Operational Metadata – v1 Capture process

dependency graph– Also capture useful

metadata such as process owners

Capture stats for each execution of a workflow

– Time of execution– Status– Pointer to error logs

Has proved quite useful for monitoring critical chains

Operational Metadata – v2

For each flow, capture input and output data elements

For each execution, capture stats on data element, e.g.

– Number of records / lines read– Number of records / lines

written– Error counts– Last processed record

Can be time based or sequence based

This can be per flow as more than one flow can consume a data element

Operational Metadata – The Payoff

Restartable ETL jobs – Process new data since last successful

previous run Catch up mode for ETL jobs

– Single run can consume data from multiple intervals in one batch

– Next run will resume from correct place Data freshness and availability dashboard Coarse form of data lineage

– Impact analysis for unfortunately all-too-common changes upstream


Putting it all Together

Online Data Stores

Data Transport Pipelines

Data Processing Pipelines

Offline Storage / Compute

Analytics Application

s

Espresso

Voldemort

Kafka

Databus

Camus

Lumos

Hadoop

Teradata

Operational Metadata and Tooling

`whoami`

Sr. Manager / DWH Architect @ LinkedIn since 2011

Prior to that:– Director of Engineering at Digg– Enterprise Data Architect at eBay

www.linkedin.com/in/rajappaiyer/

Questions?

More at data.linkedin.comWe’re Hiring

the big data analytics ecosystem at linkedin

Technology

database data

data volume

topic daily data

enterprises companies

output data elements

correct place data

run topic partitiontopics

lumos linkedin confidential