apache kaa - inspiring innovationshadam1/491s16/lectures/04-kafka.pdf · apache kaa cmsc 491...
TRANSCRIPT
![Page 1: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/1.jpg)
ApacheKa)a
CMSC491Hadoop-BasedDistributedCompu=ng
Spring2016AdamShook
![Page 2: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/2.jpg)
Overview
• Ka)aisa“publish-subscribemessagingrethoughtasadistributedcommitlog”
• Fast• Scalable• Durable• Distributed
![Page 3: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/3.jpg)
Ka)aadop=onandusecases• LinkedIn:ac=vitystreams,opera=onalmetrics,databus
– 400nodes,18ktopics,220Bmsg/day(peak3.2Mmsg/s),May2014• Ne*lix:real-=memonitoringandeventprocessing• Twi/er:aspartoftheirStormreal-=medatapipelines• Spo4fy:logdelivery(from4hdownto10s),Hadoop• Loggly:logcollec=onandprocessing• Mozilla:telemetrydata• Airbnb,Cisco,Gnip,InfoChimps,Ooyala,Square,Uber,…
3
![Page 4: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/4.jpg)
HowfastisKa)a?• “Upto2millionwrites/secon3cheapmachines”
– Using3producerson3differentmachines,3xasyncreplica=on• Only1producer/machinebecauseNICalreadysaturated
• Sustainedthroughputasstoreddatagrows– Slightlydifferenttestconfigthan2Mwrites/secabove.
4
![Page 5: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/5.jpg)
WhyisKa)asofast?• Fastwrites:
– WhileKa)apersistsalldatatodisk,essen=allyallwritesgotothepagecacheofOS,i.e.RAM.
• Fastreads:
– Veryefficienttotransferdatafrompagecachetoanetworksocket– Linux:sendfile()systemcall
• Combina=onofthetwo=fastKa)a!– Example(Opera=ons):OnaKa)aclusterwheretheconsumersaremostly
caughtupyouwillseenoreadac=vityonthedisksastheywillbeservingdataen=relyfromcache.
5
![Page 6: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/6.jpg)
Afirstlook• Thewhoiswho– Producerswritedatatobrokers.
– Consumersreaddatafrombrokers.
– Allthisisdistributed.
• Thedata– Dataisstoredintopics.– Topicsaresplitintopar44ons,whicharereplicated.
6
![Page 7: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/7.jpg)
Afirstlook
7
![Page 8: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/8.jpg)
Broker(s)
Topics
8
new
ProducerA1
ProducerA2
ProducerAn…
Producers always append to “tail” (think: append to a file)
…
Kafka prunes “head” based on age or max size or “key”
Oldermsgs Newermsgs
KaLatopic
• Topic:feednametowhichmessagesarepublished– Example:“zerg.hydra”
![Page 9: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/9.jpg)
Broker(s)
Topics
9
new
ProducerA1
ProducerA2
ProducerAn…
Producers always append to “tail” (think: append to a file)
…
Oldermsgs Newermsgs
ConsumergroupC1 Consumers use an “offset pointer” to track/control their read progress
(and decide the pace of consumption) ConsumergroupC2
![Page 10: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/10.jpg)
Par==ons
10
• Atopicconsistsofpar44ons.• Par==on:ordered+immutablesequenceofmessages
thatiscon=nuallyappendedto
![Page 11: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/11.jpg)
Par==ons
11
• #par==onsofatopicisconfigurable• #par==onsdeterminesmaxconsumer(group)parallelism
– cf.parallelismofStorm’sKa)aSpoutviabuilder.setSpout(,,N)
– Consumer group A, with 2 consumers, reads from a 4-partition topic– Consumer group B, with 4 consumers, reads from the same topic
![Page 12: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/12.jpg)
Par==onoffsets
12
• Offset:messagesinthepar==onsareeachassignedaunique(perpar==on)andsequen=alidcalledtheoffset– Consumerstracktheirpointersvia(offset,par--on,topic)tuples
ConsumergroupC1
![Page 13: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/13.jpg)
Replicasofapar==on
• Replicas:“backups”ofapar==on– Theyexistsolelytopreventdataloss.– Replicasareneverreadfrom,neverwrinento.
• TheydoNOThelptoincreaseproducerorconsumerparallelism!
– Ka)atolerates(numReplicas-1)deadbrokersbeforelosingdata• LinkedIn:numReplicas==2à1brokercandie
13
![Page 14: Apache Kaa - Inspiring Innovationshadam1/491s16/lectures/04-Kafka.pdf · Apache Kaa CMSC 491 Hadoop-Based Distributed Compu=ng Spring 2016 Adam Shook Overview • Kaa is a “publish-subscribe](https://reader030.vdocuments.mx/reader030/viewer/2022041015/5ec6102526c613458905b4a7/html5/thumbnails/14.jpg)
Ka)aQuickstart
• StepsfordownloadingKa)a,star=ngaserver,andcrea=ngaconsole-basedconsumer/producer
• RequiresZooKeepertobeinstalledandrunning
• hnps://ka)a.apache.org/documenta=on.html#quickstart
• hnps://github.com/adamjshook/hadoop-demos/tree/master/ka)a