flume office-hours-110228
TRANSCRIPT
Flume Office HoursCommunity planning
Jonathan HsiehCloudera HQ, 2/28/2011
Flume Office Hours, 2/28/2011 3
Outline
• State of the world• What’s new?• Stories (Chime in!)• What needs work?• Prioritizing what is next.• Q+A
Flume Office Hours, 2/28/2011 4
STATE OF THE WORLD
Flume Office Hours, 2/28/2011 5
Growing user and developer community
• Github stats:– Currently 295 watchers, 51 forks
• New Committers: – 9/10: Eric Sammer (Cloudera)– 1/11: Bruce Mitchener (Independent)
• User characteristics– Most potential users seem to use adhoc
scripts– Most users are early adopters / startup
devops
May-10 Jun-10 Aug-10 Sep-10 Nov-10 Jan-11 Feb-110
50
100
150
200
250
300
350
0
10
20
30
40
50
60
Watchers
Forks
Flume Office Hours, 2/28/2011 6
A short feature history
• 6/10: v0.9.0 – Initial open source release
• 8/10: v0.9.1 – Fixes for hangs – Initial compression features
• 10/10: v0.9.1+29 (CDH3b3, packages)– Added kerberized HDFS support– Flume cookbook– Elastic Search / Cassandra Plugins– Initial Voldemort Plugins
• 11/10: v0.9.2– Support for other compression codecs– Avro RPC– Improvements to tail and exec– Robustness improvements– Initial Hbase / MongoDB Plugin
• 2/11: v0.9.3 (CDH3b4, packages)– Flume Node Windows support– Initial JSON metrics support– Multi-master functional– Robustness improvements– JRuby / AMQP Plugins– S3/EC2 Blog Stories
• 4/11: v0.9.3+xxx (CDH3 Stable, packages)– Excessive Duplication fixes– Compression fixes
• ?/11: v0.9.4
Flume Office Hours, 2/28/2011 7
WHATS NEW?
Flume Office Hours, 2/28/2011 8
New features
• Flume node JSON metrics– http://node:35862/node/reports
• Terser syntax{ deco1 => { deco2 => sink } } deco1 deco2 sink
• Multiple collector sink supportcollector(30000) { [ escapedCustomDfs(“hdfs://nn1/path”,”prefix”,”format”), escapedCustomDfs(“hdfs://nn2/path”,”prefix”,”format”),
] }
• Limited Multi-master support• Windows support
Flume Office Hours, 2/28/2011 9
STORIES
Flume Office Hours, 2/28/2011 10
Flume
: The Standard Use Case
HDFS
AgentAgentAgentAgent
AgentAgentAgentAgent
AgentAgentAgentAgent
Collector
Collector
Collector
Masterserverserverserverserver
serverserverserverserver
serverserverserverserverAgent tier Collector tier
Flume Office Hours, 2/28/2011 11
: Multi Datacenter
HDFS
API se
rver
Collector tier
Pro
cess
or
serv
er
AgentAgentAgentAgentAgentAgentAgentAgentAgentAgentAgentAgent
AgentAgentAgentAgentAgentAgentAgentAgentAgentAgentAgentAgent
Collector
Collector
Collector
Collector
Collector
Collector
apiapiapiapiapiapiapiapiapiapiapiapi
apiapiapiproc
apiapiapiproc
apiapiapiproc
Flume Office Hours, 2/28/2011 12
: Multi Datacenter
HDFS
API se
rver
Collector tier
Pro
cess
or
serv
er
AgentAgentAgentAgentAgentAgentAgentAgentAgentAgentAgentAgent
AgentAgentAgentAgentAgentAgentAgentAgentAgentAgentAgentAgent
Collector
Collector
Collector
Collector
Collector
Collector
Relay
apiapiapiapiapiapiapiapiapiapiapiapi
apiapiapiproc
apiapiapiproc
apiapiapiproc
Flume Office Hours, 2/28/2011 13
Flume
: Near Realtime Aggregator
HDFS
DB Hive job
CollectorTracker AgentAgentAgentAgentAd svrAd svrAd svrAd svr
reports
verify
quickreports
Flume Office Hours, 2/28/2011 14
Flume
An enterprise storyA
PI se
rver
Collector tierAgentAgentAgentWinAgentAgentAgentLinuxAgentAgentAgentLinux
Collector
Collector
Collector
apiapiapiapiapiapiapiapiapiapiapiapi
Kerberos HDFS
D D DDDD
Active Directory / LDAP
Flume Office Hours, 2/28/2011 15
index
hbase
hdfs
An emerging community story
HDFSHive queryAgentAgentAgentAgentsvr
Collector Fanout HBase
Incremental Search Idx
Key lookup
Range query
Search query
Faceted query
Pig query
Flume
Flume Office Hours, 2/28/2011 16
WHAT NEEDS WORK?WHAT COMES NEXT?
Flume Office Hours, 2/28/2011 17
Known issues
• Excessive event duplication (due to tail or e2e agent)• Configuration translation problem in some cases• Multi-master limited: doesn’t work with translations
Flume Office Hours, 2/28/2011 18
What’s next? (proposals)
• Fix Excessive duplication issues.• Apache Incubator (?)• Log4j/Log4net/logback/etc…• Fix Multi-master limitations.• Security upgrades for node to node
comms (TLS/SSL)• Improved metrics / GUI / usability• Integration with open source
alerting/monitoring tools• Integration with proprietary systems
• Version proofing RPCs / State storage
• Packaging friendly plug-in install• Multi Datacenter Story• Performance Increases• Inline near-realtime analytics• Puppet/Chef style config for nodes• Lightweight Agent• Masterless Agent• Better S3 / AWS support
Flume Office Hours, 2/28/2011 19
Q+A