near real time processing of social media data with hbase
DESCRIPTION
Monitoring social media and news media in general is kind of building a search engine - a specific search engine called news agent. Where classical search engines are about to have all kind of content, newsagent focus on news and social media only. This is where content freshness matters - it has to be near real time. In order to process several million news with hundreds of Megabytes per day one has to choose a system architecture that is reliable on one hand and massive scalable at the other hand. Those requirements leads to a distributed architecture, a NoSQL approach. This talk will give you a brief overview about the Media Monitoring use case, the system architecture based on Hadoop and HBase, challenges and lessons learned.TRANSCRIPT
CC 2.0 by William Brawley | http://flic.kr/p/7PdUP3
2
Agenda
• Why Hadoop and HBase? • Social Media Monitoring • Prospective Search and Coprocessors
• Challenges & Lessons Learned • Resources to get started
August 31,
2012
3
About Sentric
• Spin-off of MeMo News AG, the leading provider for Social Media Monitoring & Analytics in Switzerland
• Big Data expert, focused on Hadoop, HBase and Solr
• Objective: Transforming data into insights
August 31,
2012
CC 2.0 by Editor B| h"p://flic.kr/p/bcU5aD
5
Social Media Monitoring Process
Why Hadoop and HBase?
August 31,
2012
Information Gathering
Information Processing
Analysis & Interpretation
Insight Presentation
6
Requirements
Why Hadoop and HBase?
August 31,
2012
SMM
Cost effective
High scalable
RT Alerting
Analytical capabilities
Reliable
Freshness
7
Hadoop
• HDFS + MapReduce • Based on Google Papers • Distributed Storage and Computation
Framework • Affordable Hardware, Free Software
• Significant Adoption
Why Hadoop and HBase?
August 31,
2012
8
HBase
• Non-Relational, Distributed Database • Column-Oriented • Multi-Dimensional • High Availability • High Performance • Build on top of HDFS as storage layer
Why Hadoop and HBase?
August 31,
2012
9
Technology Stack
Why Hadoop and HBase?
August 31,
2012
HBase /HDFS Storage
Hadoop Mahout Analytics
Solr Search
HBase RowLog Event mechanism (MQ)
Prospective search Real-time alerting
CC 2.0 by nolifebeforecoffee | http://flic.kr/p/c1UTf
11
Overview
Social Media Monitoring
August 31,
2012
Search Agents
Downloaded Articles
Output
match?
RT Alerts Reports Web-UI
Icons by http://dryicons.com
12
Solution Architecture
Social Media Monitoring
August 31,
2012
REST
n News Agents
MySQL Solr
Web-UI
RT Alerts
Coprocessor
HBase
Icons by http://dryicons.com
13
Prospective Search with Coprocessors
Social Media Monitoring
August 31,
2012
Processing
HRegionServer
HRegion
Put operations
Prospective Search
RT Alerts
Icons by http://dryicons.com
14
Key Figures
• Monthly growth • Index: 200GB • 50 Mio. docs/month
• HBase: 600 GB • Raw data, meta data and extracted
data
• A few 1000 map-reduce jobs/month
Social Media Monitoring
August 31,
2012
CC 2.0 by saebaryo | h"p://flic.kr/p/5T4t5L
16 1 Benchmarks - workloads 2 Supervision 3 Keys and shards – Schema design /LG 4 Timestamps, the 4th dimension 5 Short ColumnFamily names-> 6 File handles. OS 7 JVM Tuning, GC !!! 8 Scaling region servers, data locality! 9 Automatic vs manual splits, compaction 10 Do not use HBase as rock solid in prod 11 Forget feuerwehr aktionen, it takes some time 12 Use Hbase for a apropriate use case 13 Tune and tweak – it‘s not a project – it‘s a process 14 You need devops in production 15 Huge know-how curve, you need to know the hole ecosystem 16 Use a distribution, ist packed, tested and supports migration, enterprise grade 17 Virtualisierung, Hardware 18 Dont struggle to much, there is a good community 19 Share your knowledge 20 It‘s early state, many tools around, a few still missing
Challenges & Lessons Learned
August 31, 2012
17
Challenges
• Everyone is still learning • Some issues only appear at scale • At scale, nothing works as advertised
• Production cluster configuration • Hardware issues • Tuning cluster configuration to our work
loads
• HBase stability • Monitoring health of HBase
Challenges & Lessons Learned
August 31,
2012
18
Lessons - General
• Do not rely on HBase as frontend storage layer. It’s not going to be rock solid
• Don’t struggle to much, there is a good community
• Share your knowledge • It‘s early stage, many tools around, a
few still missing
Challenges & Lessons Learned
August 31,
2012
19
Lessons - Planning
• Use HBase for an appropriate use case • Use a distribution, its packed, tested and
supports migration, enterprise grade • Benchmarks – know your workloads &
query patterns • YCSB
• Schema & Key Design • What’s queried together should be stored
together • Scaling region servers, data locality! • Virtualization vs. Real Hardware
Challenges & Lessons Learned
August 31,
2012
20
Lessons - Performance Tuning
• Number of CF < 10 • Compaction + Flushing I/O intensive
• Short ColumnFamily names • HFile index size occupying aloc RAM (storefileindexSize)
• OS file handles • ulimit –n 32768
• JVM Tuning, GC !!! • HMaster 1024 MB • RegionServer 8192 MB • -XX:+UseConcMarkSweepGC • -XX:+CMSIncrementalMode
• Automatic vs. manual splits • Be careful with expensive operations in coprocessors • Play with all the configurations and benchmark for tuning
Challenges & Lessons Learned
August 31,
2012
21
Lessons - Operation
• Monitoring/Operational tooling is most important
• Forget “emergency actions”, it takes some time
• Tune and tweak – it‘s not a project – it‘s a process
• You need DevOps in production • Huge know-how curve, you need to
know the whole ecosystem • Hadoop, HDFS, MapRed
Challenges & Lessons Learned
August 31,
2012
22
Resources to get started
• http://hbase.apache.org/book.html • http://www.sentric.ch/blog/best-
practice-why-monitoring-hbase-is-important
• http://www.sentric.ch/blog/hadoop-overview-of-top-3-distributions
• http://www.sentric.ch/blog/hadoop-best-practice-cluster-checklist
• http://outerthought.org/blog/465-ot.html
August 31,
2012
23
Thank you!
Questions? Christian Gügi, [email protected]
Jean-Pierre König, [email protected]
NoSQL Roadshow Basel
August 31,
2012
24
Cluster
Masters
August 31, 2012
25
Cluster
Worker
August 31, 2012