apache kafka best practices
Post on 21-Jan-2018
16.964 Views
Preview:
TRANSCRIPT
Apache Kafka Best PracticesManikumar Reddy
@omkreddy
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Kafka
Core APIs– The Producer API
– The Consumer API
– The Connector API
– The Streams API
Broad classes of applications– Building real-time streaming data pipelines
– Building real-time streaming applications
– core building block in other data systems
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Key Concepts and Terminology
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Component Layout
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hardware Guidance
Cluster Size Memory CPU Storage
Kafka Brokers 3+24G+ (for small)64GB+ (for large)
Multi- core processors( 12 CPU+ core), Hyper threading enabled
6+ x 1TB dedicated disks( RAID or JBOD)
Zookeeper3 (for small)5 (for large)
8GB+ (for small)24GB+ (for large)
2 core +
SSD for Transaction logs
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
OS Tuning
OS Page Cache– Ex: Allocate to hold all the active segments of the log.
File descriptor limits : >100k
less swapping
Tcp tuning
JVM Configs– Java 8 with G1 Collector
– 6-8 GB heap
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Disk Storage
Use multiple disk spindles, dedicated to kafka
JBOD vs RAID10
JBOD– Gives all the disk I/O
JBOD Limitations– any disk failure causes an unclean shutdown and requires lengthy recovery
– data is not distributed consistently across disks
– Multiple directories
KIP-112/113– necessary tools for users to manage JBOD
– Intelligent partition assignment
– On disk failure, broker can serve replicas on the good disks
– re-assign replicas between disks of the same broker
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
RAID
RAID10– Can survive single disk failure
– Performance and protection
– balance load across disks
– Single mount point
– Performance hit and reduces the space
File System– EXT or XFS
– SSD
– Issues on NFS.
– SAN, NAS
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Basic Monitoring
CPU Load
Network Metrics
File Handle Usage
Disk Space
Disk I/O Performance
Garbage Collection
ZooKeeper Monitoring
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Replication
Partition has replicas – Leader replica, Follower replicas
Leader maintains in-sync-replicas (ISR)– replica.lag.time.max.ms, num.replica.fetchers
– min.insync.replica – used by producer to ensure greater durability
https://www.slideshare.net/junrao/kafka-replication-apachecon2013
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Under Replicated Partitions
Number of partitions which are not fully replicated within the cluster
Mbean - kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions
ISR Shrink/Expand Rate
Under Replicated Partitions– Lost Broker?
– Controller Issues
– Zookeeper Issues
– Network Issues
Solutions– Tune the ISR settings
– Expand brokers
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Controller
Manages Partitions Life cycle
Avoid controller's ZK session expires– Soft failures – ISR Churn/Under replicated partitions
– ZK Server performance
– Long GC pauses on Broker
– Bad network configuration
Monitoring– Mbean : kafka.controller:type=KafkaController,name=ActiveControllerCount
– only one broker in the cluster should have 1
– LeaderElectionRate
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Unclean leader election
Enable replicas not in the ISR set to be elected as leader
Availability vs correctness – By-default kafka chooses availability
Monitoring– Mbean : kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec
Default will be changed in next release
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Broker Configs
log.retention.{ms, minutes, hours} , log.retention.bytes
message.max.bytes, replica.fetch.max.bytes
delete.topic.enable
unclean.leader.election.enable = false
min.insync.replicas = 2
replica.lag.time.max.ms, num.replica.fetchers
replica.fetch.response.max.bytes
zookeeper.session.timeout.ms = 30s
num.io.threads
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cluster Sizing
Broker Sizing– Partition count on each broker (<2K)
– Keep partition size on disk manageable (under 25GB per partition )
Cluster Size (no. of brokers)– how much retention we need
– how much traffic cluster is getting
Cluster Expansion– Disk usage on the log segments partition should stay under 60%
– Network usage on each broker should stay under 75%
Cluster Monitoring– Keep cluster balanced
– Ensure that partitions of a topic are fairly distributed across brokers
– Ensure that nodes in a cluster are not running out of disk and network
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Broker Monitoring
Partition Counts– Mbean: kafka.server:type=ReplicaManager,name=PartitionCount
Leader replica counts– Mbean: kafka.server:type=ReplicaManager,name=LeaderCount
ISR Shrink Rate/ISR expansion rate– kafka.server:type=ReplicaManager,name=IsrExpandsPerSec
Message in rate/Byte in rate/Byte out rate
NetworkProcessorAvgIdlePercent
RequestHandlerAvgIdlePercent
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Topic Sizing
No. of partitions – Have at least as many partitions as there are consumers in the largest group
– topic is very busy – more partitions
– Keep partition size on disk manageable (under 25GB per partition )
– Take into account any other application requirements
– Special use cases – single partition
Keyed messages– enough partitions to deal with future growth
expanding partitions – whenever the size of the partition on disk is larger than threshold
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Choosing Partitions
Based on throughput requirements one can pick a rough number of partitions.– Lets call the throughput from producer to a single partition is P
– Throughput from a single partition to a consumer is C
– Target throughput is T
– At least max (T/P, T/C)
More Partitions– More open file handles
– May increase unavailability
– May increase end-to-end latency
– More memory for clients
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Quotas
Protect from bad clients and maintain SLAs
byte-rate thresholds on produce and fetch requests
can be applied to (user, client-id), user or client-id groups.
Server delays the responses
Broker Metrics for monitoring – throttle-rate, byte-rate
replica.fetch.response.max.bytes– Limit memory usage of replica fetch response
Limiting bandwidth usage during data migration– kafka-reassign-partitions.sh -- -throttle option
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Producer
User new java based clients
Test in your Environment– kafka-producer-perf-test.sh
Memory
CPU
Batch Compression
Avoid large messages– creates more memory pressure
– slows down the brokers
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Critical Configs
batch.size– size based batching
– larger size -> high throughput, higher latency
linger.ms– time based batching
– larger size -> high throughput, higher latency
max.in.flight.requests.per.connection– Better throughput, affects ordering
compression.type– adding more user threads can help throughput
acks– Affects message durability
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Performance tuning
If throughput < network capacity– Add more user threads
– Increase batch size
– Add more producers instances
– Add more partitions
Latency when acks = -1– Increase num.replica.fetchers
Cross datacenter data transfer – Tune socket buffer settings, OS tcp buffer settings
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Producer Monitoring
batch-size-avg
compression-rate-avg
waiting-threads
buffer-available-bytes
record-queue-time-max
record-send-rate
records-per-request-avg
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Consumer
Test in your Environment– kafka-consumer-perf-test.sh
Throughput Issues– not enough partitions
– OS Page Cache - allocate enough to hold all the messages for your consumers for say, 30s
– Application/Processing logic
Offsets topic– __consumer_offsets
– offsets.topic.replication.factor
– offsets.retention.minutes
– Monitor ISR, topic size
Slow offset commits– commit async, manual commits
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Consumer Configs
fetch.min.bytes and fetch.max.wait.ms
max.poll.interval.ms
max.poll.records
session.timeout.ms
Consumer Rebalance– check timeouts
– check processing times/logic
– GC Issues
Tune network settings
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Consumer Monitoring
Whether or not the consumer is keeping up with the messages that are being produced
Consumer Lag: Difference between the end of the log and the consumer offset
Monitoring– Metrics Monitoring - records-lag-max
– bin/kafka-consumer-groups.sh
– LinkedIn’s Burrow for consumer monitoring
Decreasing Lag– Analyze consumer - GC Issues, hung instance
– Add more consumer Instances
– increase the number of partitions and consumers
27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
No data loss settings
Producer– block.on.buffer.full=true
– retries=Long.MAX_VALUE
– acks=all
– max.in.flight.requests.per.connection=1
– close producer
Broker– replication factor >= 3
– min.insync.replicas=2
– disable unclean leader election
Consumer– disable auto.offset.commit
– Commit offsets only after the messages are processed
28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Authorizer - Ranger Auditing
29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Mirror Maker
Tool to mirror a source Kafka cluster into a target (mirror) Kafka cluster
30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Mirror Maker Run multiple mirroring processes
– high fault-tolerance
– high throughput
--num.streams option to specify the number of consumer threads
– no.of threads in num.streams
31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Kafka Mirror Maker Consumer and source cluster socket buffer sizes
– high value for the socket buffer size
– consumer's fetch size
– OS networking Tuning
Source and Target Clusters are independent entities– Can be different numbers of partitions
– offsets will not be the same.
– partitioning order is preserved on a per-key basis.
Create topics in target cluster
Monitor whether a mirror is keeping up– Consumer Lag
Running In Secure Clusters– We recommend to use SSL
– We can run MM on source cluster
32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Open source Operational Tools
Ambari Metrics– https://docs.hortonworks.com/HDPDocuments/Ambari-2.4.2.0/bk_ambari-user-
guide/content/grafana_kafka_dashboards.html
Removing brokers and rebalancing partitions in a cluster– https://github.com/linkedin/kafka-tools
Consumer Lag Monitoring– Burrow (https://github.com/linkedin/Burrow)
Kafka Manager - https://github.com/yahoo/kafka-manager
33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Kafka 0.10.2 release
Includes 15 KIPs, over 200 bug fixes and improvements
The newest Java Clients now support older brokers (0.10.0 and higher)
Separation of Internal and External traffic
Create Topic Policy
Security Improvements– Support for SASL/SCRAM mechanisms
– Dynamic JAAS configuration for Kafka clients
– Support for authentication of multiple Kafka clients in single JVM
Producer and Consumer Improvements
Connect API & Streams API improvements
34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You
35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
References
http://kafka.apache.org/documentation.html
https://community.hortonworks.com/articles/80813/kafka-best-practices-1.html
https://www.slideshare.net/JiangjieQin/producer-performance-tuning-for-apache-kafka-63147600
https://www.slideshare.net/ToddPalino/tuning-kafka-for-fun-and-profit
https://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844
https://www.slideshare.net/ToddPalino/putting-kafka-into-overdrive
https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
top related