the big data analytics ecosystem at linkedin
DESCRIPTION
LinkedIn has several data driven products that improve the experience of its users -- whether they are professionals or enterprises. Supporting this is a large ecosystem of systems and processes that provide data and insights in a timely manner to the products that are driven by it. This talk provides an overview of the various components of this ecosystem which are: - Hadoop - Teradata - Kafka - Databus - Camus - Lumos etc.TRANSCRIPT
The Big Data Analytics Ecosystem at LinkedIn
Rajappa IyerSeptember 17, 2013
Agenda
LinkedIn by the numbers An Overview of Data Driven Products /
Insights The Big Data Analytics Ecosystem
– Storage and Compute Platforms– Data Transport Pipelines– Data Processing Pipelines– Operational Tooling - Metadata
Q&A
LinkedIn: The World’s Largest Professional Network
Members Worldwide
2 newMembers Per Second
100M+Monthly Unique Visitors
238M+ 3M+ Company Pages
Connecting Talent Opportunity. At scale…
Insights
(Analysts and Data Scientists)
Data Driven Products and Insights
Products for Members
(Professionals)
Products for Enterprises
(Companies)
Data,Platforms,Analytics
Products for Members
Products for Enterprises
Sell - Sales Navigator Market - Marketing Solutions
Hire - Talent Solutions
Examples of Business Insights
Example of Deeper Insight
Job Migration After Financial Collapse
A Simplified Overview of Data Flow
LinkedIn Confidential ©2013 All Rights Reserved 10
Storage and Compute Platforms
Most data in Avro format Access via Hive and Pig
Most ETL processes run hereSpecialized batch processing
Algorithmic data mining
LinkedIn Confidential ©2013 All Rights Reserved 11
Storage and Compute Platforms
Integrated Data Warehouse
Standard BI Tools
Interactive Querying(Low latency)
Workload Management
LinkedIn Confidential ©2013 All Rights Reserved 12
Transport Pipeline - Kafka
High-volume, low-latency messaging system
Horizontally scalable Automatic load balancing Rewindability Intra-cluster replication Mainly used for log
aggregation and queuing
LinkedIn Confidential ©2013 All Rights Reserved 13
Transport Pipeline - Databus
Timeline consistent data change capture
Works with Oracle, MySQL, Espresso… Transactional semantics In-order, at least once delivery Low latency Has scaled to 100s of sources
LinkedIn Confidential ©2013 All Rights Reserved 14
Processing Pipeline: Camus
Camus: Framework for ingesting Kafka streams to HDFS
LinkedIn Confidential ©2013 All Rights Reserved 15
Camus: Features
Highly scalable due to adaptive input format
– Handled 10x increase in data volume without change
Restartable with checkpointing Robust auditing support Plays nicely with Hive and Pig
– Avro format support– Hive metastore registration
Open source– https://github.com/linkedin/camus
LinkedIn Confidential ©2013 All Rights Reserved 16
Processing Pipeline: Lumos
Lumos: Framework to ingest database data to HDFS
PRODOracle
VirtualSnapshot
Materializer
ETL Hadoop Cluster
Staging Data(internal)
Data-BusDB
Extract
LazySnapshot
Materializer
ExternalData
Inc/Full(internal)
DWHprocesses
Meta-Data
PublishedVirtual Snapshot
Pig/HiveLoaders
PRODEspresso
LinkedIn Confidential ©2013 All Rights Reserved 17
Lumos: Features
Supports Espresso, Oracle and MySQL as sources
Full snapshots and incremental dumps Automatic type translation for most database
types Provides LAST UPDATE semantics for data Supports low latency requirements
– Reader API performs just-in-time compaction Snapshot constructed two ways:
– On demand compaction for upserts– Periodic snapshotting that reflects deletes as well
Operational Support - Metadata
ETL pipeline is a complex graph of workflows
– Our comprehensive dashboard production flow is nearly 30 levels deep with complex dependencies
To manage this, we needed to capture:– Process dependencies– Data dependencies– Process execution history– Data load status– Data consumption status (watermarks)
Operational Metadata – v1 Capture process
dependency graph– Also capture useful
metadata such as process owners
Capture stats for each execution of a workflow
– Time of execution– Status– Pointer to error logs
Has proved quite useful for monitoring critical chains
Operational Metadata – v2
For each flow, capture input and output data elements
For each execution, capture stats on data element, e.g.
– Number of records / lines read– Number of records / lines
written– Error counts– Last processed record
Can be time based or sequence based
This can be per flow as more than one flow can consume a data element
Operational Metadata – The Payoff
Restartable ETL jobs – Process new data since last successful
previous run Catch up mode for ETL jobs
– Single run can consume data from multiple intervals in one batch
– Next run will resume from correct place Data freshness and availability dashboard Coarse form of data lineage
– Impact analysis for unfortunately all-too-common changes upstream
LinkedIn Confidential ©2013 All Rights Reserved 22
Putting it all Together
Online Data Stores
Data Transport Pipelines
Data Processing Pipelines
Offline Storage / Compute
Analytics Application
s
Espresso
Voldemort
Kafka
Databus
Camus
Lumos
Hadoop
Teradata
Operational Metadata and Tooling
`whoami`
Sr. Manager / DWH Architect @ LinkedIn since 2011
Prior to that:– Director of Engineering at Digg– Enterprise Data Architect at eBay
www.linkedin.com/in/rajappaiyer/
Questions?
More at data.linkedin.comWe’re Hiring