building scalable big data infrastructure using open source software
DESCRIPTION
Building Scalable Big Data Infrastructure Using Open Source Software. Sam William [email protected]. What is StumbleUpon?. Help users find content they did not expect to find. The best way to discover new and interesting things from across the Web. How StumbleUpon works. - PowerPoint PPT PresentationTRANSCRIPT
![Page 2: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/2.jpg)
What is StumbleUpon?
Help users find content they did not expect to find
The best way to discover new and interesting things from across the
Web.
![Page 3: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/3.jpg)
How StumbleUpon works
1. Register 2. Tell us your interests 3. Start Stumbling and rating web pages
We use your interests and behavior to recommend new content for you!
![Page 4: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/4.jpg)
StumbleUpon
![Page 5: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/5.jpg)
The Data Challenge
1 Data collection
2 Real time metrics
3 Batch processing / ETL
4 Data warehousing & ad-hoc analysis
5 Business intelligence & Reporting
![Page 6: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/6.jpg)
Challenges in data collection
• Different services deployed of different tech stacks
• Add minimal latency to the production services
• Application DBs for Analytics / batch processing– From HBase & MySQL
Site Rec /Ad Server Other internal services
Apache / PHP Scala / Finagle Java / Scala / PHP
![Page 7: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/7.jpg)
Data Processing and Warehousing
Raw Data
ETL
Warehouse(HDFS)
TablesTablesMassively
Denormalized Tables
Challenges/Requirements:
•Scale over 100 TBs of data
•End product works with easy querying tools/languages
•Reliable and Scalable – powers analytics and internal reporting.
![Page 8: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/8.jpg)
Real-time analytics and metrics
• Atomic counters
• Tracking product launches
• Monitoring the health of the site
• Latency – live metrics makes sense
• A/B tests
![Page 9: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/9.jpg)
Open Source at SU
![Page 10: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/10.jpg)
Data Collection at SU
![Page 11: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/11.jpg)
Activity Streams and Logs
All messages are Protocol BuffersFast and EfficientMultiple Language Bindings ( Java/ C++ / PHP )CompactVery well documentedExtensible
![Page 12: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/12.jpg)
Apache Kafka
• Distributed pub-sub system • Developed @ LinkedIn• Offers message persistence• Very high throughput
– ~300K messages/sec
• Horizontally scalable • Multiple subscribers for topics .
– Easy to rewind
![Page 13: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/13.jpg)
Kafka
• Near real time process can be taken offline and done at the consumer level
• Semantic partitioning through topics
• Partitions for parallel consumption
• High-level consumer API using ZK
• Simple to deploy- only requires Zookeeper
![Page 14: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/14.jpg)
Kafka At SU
• 4 Broker nodes with RAID10 disks
• 25 topics
• Peak of 3500 msg/s
• 350 bytes avg. message size
• 30 days of data retention
![Page 15: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/15.jpg)
Sutro
• Scala/Finagle
• Generic Kafka message producer
• Deployed on all prod servers
• Local http daemon
• Publishes to Kafka asynchronously
• Snowflake to generate unique Ids
![Page 16: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/16.jpg)
Sutro - Kafka
Site-
Apache/PHP
Sutro
Ad Server-
Scala/Finagle
Sutro
Rec Server-
Scala/Finagle
Sutro
OtherServices
Sutro
Kafka
BrokerBroker BrokerBrokerBrokerBroker
![Page 17: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/17.jpg)
Application Data forAnalytics & Batch Processing
• HBase inter-cluster replication(from production to batch
cluster)
• Near real-time sync on batch cluster
• Readily available in Hive for analysis
HBase
• MySQL replication to Batch DB Servers
• Sqoop incremental data transfer to HDFS
• HDFS flat files mapped to Hive tables & made available for analysis
MySQL
![Page 18: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/18.jpg)
Real-time metrics
1. HBase – Atomic Counters
2. Asynchbase - Coalesced counter inc++
1. OpenTSDB (developed at SU)
– A distributed time-series DB on HBase– Collects over 2 Billion data points a day – Plotting time series graphs– Tagged data points
![Page 19: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/19.jpg)
Real-time counters
Real time metrics from OpenTSDB
![Page 20: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/20.jpg)
Kafka Consumer framework aka Postie
• Distributed system for consuming messages• Scala/Akka -on top of Kafka’s consumer API• Generic consumer - understands protobuf • Predefined sinks HBase / HDFS (Text/Binary) / Redis• Consumers configured with configuration files• Distributed / uses ZK to co-ordinate• Extensible
![Page 21: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/21.jpg)
Postie
![Page 22: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/22.jpg)
Akka
• Building concurrent applications made easy !!
• The distributed nodes are behind Remote Actors
• Load balancing through custom Routers
• The predefined sink and services are accessed through local actors
• Fault-tolerance through actor monitoring
![Page 23: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/23.jpg)
Postie
![Page 24: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/24.jpg)
Batch processing / ETL
Batch processing / ETL
Batch processing / ETL
GOAL: Create simplified data-sets from complex data
Create highly denormalized data sets for faster querying
Power the reporting DB with daily stats
Output structured data for specific analysis
e.g. Registration Flow analysis
![Page 25: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/25.jpg)
Our favourite ETL tools:• Pig
– Optional Schema– Work on byte arrays– Many simple operations can be done without UDFs– Developing UDFs is simple (understand Bags/Tuples)– Concise scripts compared to the M/R equivalents
• Scalding– Functional programming in Scala over Hadoop
– Built on top of Cascading
– Operating over tuples is like operating over collections in Scala
– No UDFs .. Your entire program is in a full-fledged general purpose language
![Page 26: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/26.jpg)
Warehouse - Hive
Uses SQL-like querying language
All Analysts and Data Scientists versed in SQL
Supports Hadoop Streaming (Python/R)
UDFs and Serdes make it highly extensible
Supports partitioning with many table properties configurable at the partition level
![Page 27: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/27.jpg)
Hive at StumbleUpon
HBaseSerde• Reads binary data from HBase• Parses composite binary values into
multiple columns in Hive (mainly on key)
ProtobufSerde
• For creating Hive tables on top of binary protobuf files stored in HDFS
• Serde uses Java reflection to parse and project columns
![Page 28: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/28.jpg)
Data Infrastructure at SU
![Page 29: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/29.jpg)
Data Consumption
![Page 30: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/30.jpg)
End Users of Data
Data Pipeline
Warehouse
Who uses this data? •Data Scientists/Analysts•Offline Rec pipeline•Ads Team
… all this work allows them to focus on querying and analysis which is critical to the business.
![Page 31: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/31.jpg)
Business Analytics / Data Scientists
• Feature-rich set of data to work on• Enriched/Denormalized tables reduce JOINs,
simplifies and speeds queries – shortening path to analysis.
• R: our favorite tool for analysis post Hadoop/Hive.
![Page 32: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/32.jpg)
Recommendation platform
• URL score pipeline– M/R and Hive on Oozie– Filter / Classify into buckets– Score / Loop– Load ES/HBase index
• Keyword generation pipeline– Parse URL data– Generate Tag mappings
![Page 33: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/33.jpg)
URL score pipeline
![Page 34: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/34.jpg)
Advertisement Platform
• Billing Service– RT Kafka consumer– Calculates skips– Bills customers
• Audience Estimation tool– Pre-crunched data into multiple dimensions– A UI tool for Advertisers to estimate target audience
• Sales team tools– Built with PHP leveraging Hive or pre-crunched
ETL data in HBase
![Page 35: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/35.jpg)
More stuff on the pipeline
• Storm from Twitter– Scope for lot more real time analytics.– Very high throughput and extensible– Applications in JVM language
• BI tools– Our current BI tools / dashboards are minimal– Google charts powered by our reporting DB
(HBase primarily).
![Page 36: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/36.jpg)
Open Source FTW!!
• Actively developed and maintained
• Community support
• Built with web-scale in mind
• Distributed systems – Easy with Akka/ZK/Finagle
• Inexpensive
• Only one major catch !!– Hire and retain good engineers !!
![Page 37: Building Scalable Big Data Infrastructure Using Open Source Software](https://reader036.vdocuments.mx/reader036/viewer/2022062323/56815a8b550346895dc80083/html5/thumbnails/37.jpg)
Thank You!
Questions
?