BIG DATA: March 2020

KAFKA DOCUMENTATION

Apache Kafka is a community distributed event streaming platform capable of handling trillions of events a day. Initially conceived as a messaging queue, Kafka is based on an abstraction of a distributed commit log. Since being created and open-sourced by LinkedIn in 2011, Kafka has quickly evolved from messaging queue to a full-fledged event streaming platform.

TOPIC:

A particular stream of Data

* Similar to a table in a database(without all constraints)

* You can have as many topics as you want

* A topic is identified by its name

*Topics are split into Partitions

*Each partition is ordered

*Each message within a partition gets on incremental id, called offset

Example:

*say you have a fleet of trucks, each truck reports it’s GPS position to Kafka

*You can have a topic truck-GPS that contains the position of all trucks

*Each truck will send a message to Kafka every 20 seconds, each message will contain the Truck ID

& Truck position(latitude & longitude)

*we choose to create that topic with 10 partitions(orbitary number)

*offset only have a meaning for a specific partition, Eg offset 3 in partition 0 doesn't represent the same data as offset 3 in partition 1

*Order is guaranteed only within a partition(not across partitions)

*Data is kept only for a limited time(default is one week)

*once the data is written to a partition, it can’t be changed(Immutability)

*Data is assigned randomly to a partition unless a key is provided(more on this later)

BROKERS:

*A Kafka cluster is composed of multiple brokers(Servers)

*Each broker is identified with its ID(integer)

*Each broker contains certain topic partitions.

*After connecting to any broker ( called a bootstrap broker), you will be connected to the either cluster

*A good number to get started is 3 brokers, but some big clusters have over 100 brokes

*In these examples we choose to number brokers starting at 100(orbitary)

Broker

101

Broker

102

Broker

103

BROKER & TOPICS

* Example of Topic-A with 3 Partitions

* Example of Topic-B with 2 Partitions

*NOTE:Data is distributed & Broker 103 doesn’t have any Topic B data.

Topic Replication Factor

* Topics should have a replication factor >1(usually between 2 & 3)

*This way if a broker is down, another broker can serve the data

*Example: Topic-A with 2 partitions & replication of 2

*Example: we lost Broker 102

*Result: Broker 101 & 103 can still serve the data

*At any time only ONE broker can be a leader for a given partition.

*only that leader can receive & serve data for a partition

*The other brokers will synchronize the data

*Therefore each partition has one leader & multiple ISR(in-sync-replica)

PRODUCERS

*Producers write data to topics(which is made of partitions)

*Producers automatically know to which broker & partition to write to.

*In case of broker failures, producers will automatically recover

*Producers can choose to receive acknowledgment of data writes:

*acks=0 producer Don’t wait for an acknowledgment (possible data loss)

*acks=1 Producer will wait for leader acknowledgment (limited data loss)

*acks=all leader +replicas acknowledgment(no data loss)

Producers: Message keys

*Producer can choose to send a key with the message (String, number, etc)

*If key=null, data is sent round-robin (broker 101 then 102 then 103...)

*If the key is sent, then all messages for that key will always go to the same partition

*A key is basically sent if you need message orbitary for a specific field(ex:trick_id)

CONSUMERS & CONSUMER GROUPS

CONSUMERS

*Consumers read data from a topic (Identified by name)

*Consumers Known which broker to read from

*In case of broker failures, Consumes Know how to recover

*Data is read in order with each Partition

CONSUMER GROUP

*Consumers read data in consumer groups

*Each Consumer within a group reads from exclusive Partitions.

*If you have more consumers than Partitions, some consumers will be inactive

CONSUMER OFFSETS

*Kafka stores the offsets at which a consumer group has been reading.

*The offsets committed live in a Kafka topic named_consumer_offsets(_ _)

*When a consumer in a group has processed data received from Kafka, it should be committing the offsets

*if a consumer disc,it will be able to read back from where it left off thanks to the committed consumer offsets

Delivery Semantics for Consumers

*Consumers choose when to commit offsets

*There are 3 delivery semantics:

*At most once:

>Offsets are committed as soon as the message is received.

>if the processing goes wrong, the message will be lost (it won’t be read again)

*At least once(usually Preferred):

>offsets are committed after the message is processed.

>If the processing goes wrong, the message will be read again.

>This can result in duplicate processing is idempotent(i.e processing again the messages won’t impact your systems).

*Exactly Once:

>Can be achived for kafka => Kafka workflows using Kafka streams API

>For kafka => External system workflows, use an idempotent consumer.

Kafka Broker Discovery

· Every Kafka broker is also called a “bootstrap server”

· That means that you only need to connect to the entire cluster

· Each broker knows about all brokers, topics & partition(metadata)

ZOOKEEPER:

· Zookeeper manages brokers(keeps a list of then)

· Zookeeper helps in performing leader election for partitions

· Zookeeper sends notifications to kafka in case of changes(e.g new topic, broker dies, broker comes up,delete toipc, etc...)

· kafka can’t work witout zookeeper.

· zookeeper by design operates with an odd number of servers(3,5,7)

· Zookeeper has a leader (handle writes) the reset of the servers are followers

· zookeeper does not store consumer offsets with kafka >v0.10

KAFKA GUARANTEES:

· Messages are appended to a topic-partition in the order they are sent.

· Consumers read messages in the order stored in a topic-partition.

· With a replication factor of N, Producers & Consumers can tolerate upto N-1 brokers being down.

· This is why a replication factor of 3 is a good idea.

· Allows for one broker to be taken down for maintenance.

· Allows for another broker to be taken down unexpectedly.

· As long as the number of partitions remains constant for a topic (no new partitions)the same key will always go to the same partition

COMMANDS TO START KAFKA

Zookeeper server starts

zookeeper-server-start.sh config/zookeeper.properties

#Kafka Topic

kafka-server-start.sh config/server.properties

#Create Topic

kafka-topics.sh --zookeeper 127.0.0.1:2181 --topic third-topic --create --partitions 3 --replication-factor 1

#To check kafka topic created or not

kafka-topics.sh--zookeeper 127.0.0.1:2181 --list

#To check the details of topic

kafka-topics.sh --zookeeper 127.0.0.1:2181 --topic first_topic --describe

#Delete topic

kafka-topics.sh --zookeeper 127.0.0.1:2181--topic second_topic --delete

#kafka console producer CLI

$kafka-console-producer.sh(It shows list)

#Create Message in producer

kafka-console-prpducer.sh--broker-list 127.0.0.1:9092 --topic first_topic

>Hello

>How are You

#Creating producer property

Kafka -console-producer --broker-list 127.0.0.1:9092 --topic first_topic --producer-property acks=all

#Automatically create topic if given the wrong name in that name one new topic is created

kafka-console-producer --broker-list 127.0.0.1:9092 --topic new_topic

>hey this topic does not exit!

WARN[producer clientID = console-producer

>another message

#check the list of kafka topics

kafka-topics.sh --zookeeper 127.0.0.1:2181 --list

#By default to set the partition

3 or more

in nano config/server.properties

#kafka console consumer CLI

kafka-console-consumer.sh

kafka-console-consumer.sh --bootstarp-server 127.0.0.1:9092 --topic first-topic

kafka-console-consumer.sh --bootstarp-server 127.0.0.1:9092 --topic first-topic --from -begining

#kafka consumers in Group(From IDE)

kafka-console-consumer --bootstrap-server 127.0.0.1:9092 --topic first-topic --group my-first-application

$kafka-consumer-groups.sh

$kafka--consumer-groups --bootstraps-server localhost:9092 --list

my-first-application

console-consumer-10824

my-second-application

console-consumer-1052

#Resetting offsets

How to do restart(--to-datetime,--by-period,--to-earliest,--to-latest,--shift-by,--from-file,--to-current)

$kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group my-first-application --reset-offsets --to-earliest --execute --topic first_topic

$kafka-consumer-groups.sh --bootstrap.server localhost:9092 --decribe –group my-first-application

#Shift-by(which it takes back messages)

kafka-consumer-groups –bootstrap-server localhost:9092 --group my-first-application --reset-offsets --shift-by -2 --execute --topic first_topic

Kafka Producer:

from kafka import KafkaProducer

from kafka.errors import KafkaError

producer = KafkaProducer(bootstrap_servers=['broker1:1234'])

# Asynchronous by default

future = producer.send('my-topic', b'raw_bytes')

# Block for 'synchronous' sends

try:

record_metadata = future.get(timeout=10)

except KafkaError:

# Decide what to do if produce request failed...

log.exception()

pass

# Successful result returns assigned partition and offset

print (record_metadata.topic)

print (record_metadata.partition)

print (record_metadata.offset)

# produce keyed messages to enable hashed partitioning

producer.send('my-topic', key=b'foo', value=b'bar')

# encode objects via msgpack

producer = KafkaProducer(value_serializer=msgpack.dumps)

producer.send('msgpack-topic', {'key': 'value'})

# produce json messages

producer = KafkaProducer(value_serializer=lambda m: json.dumps(m).encode('ascii'))

producer.send('json-topic', {'key': 'value'})

# produce asynchronously

for _ in range(100):

producer.send('my-topic', b'msg')

def on_send_success(record_metadata):

print(record_metadata.topic)

print(record_metadata.partition)

print(record_metadata.offset)

def on_send_error(excp):

log.error('I am an errback', exc_info=excp)

# handle exception

# produce asynchronously with callbacks

producer.send('my-topic', b'raw_bytes').add_callback(on_send_success).add_errback(on_

˓→send_error)

# block until all async messages are sent

producer.flush()

# configure multiple retries

producer = KafkaProducer(retries=5)

Kafka Consumer:

from kafka import KafkaConsumer

# To consume latest messages and auto-commit offsets

consumer = KafkaConsumer('my-topic',

group_id='my-group',

bootstrap_servers=['localhost:9092'])

for message in consumer:

# message value and key are raw bytes -- decode if necessary!

# e.g., for unicode: `message.value.decode('utf-8')`

print ("%s:%d:%d: key=%s value=%s" % (message.topic, message.partition,

message.offset, message.key,

message.value))

# consume earliest available messages, don't commit offsets

KafkaConsumer(auto_offset_reset='earliest', enable_auto_commit=False)

# consume json messages

KafkaConsumer(value_deserializer=lambda m: json.loads(m.decode('ascii')))

# consume msgpack

KafkaConsumer(value_deserializer=msgpack.unpackb)

# StopIteration if no message after 1sec

KafkaConsumer(consumer_timeout_ms=1000)

# Subscribe to a regex topic pattern

consumer = KafkaConsumer()

consumer.subscribe(pattern='^awesome.*')

# Use multiple consumers in parallel w/ 0.9 kafka brokers

# typically you would run each on a different server / process / CPU

consumer1 = KafkaConsumer('my-topic',

group_id='my-group',

bootstrap_servers='my.server.com')

consumer2 = KafkaConsumer('my-topic',

group_id='my-group',

bootstrap_servers='my.server.com')

#acks &min.insync replicas

producers Acks Deep Drive

acks =0(no acks)

· No response is requested

· If the broker goes offline or an exception happens, we want to know & will lose data

· Useful for data where it’s okay to potentially lose messages:

#Metrics collection

#Log Collection

· Leader response is requested, but replication is not a guarantee(happens in the background)

· If an ack is not received,the producer may retry

· If the leader broker goes offline but replicas haven’t replicated the data yet, we have a data loss.

· Leader + Replicas ack requested

#Retries & max.in.flight.requests.per.connection

Producer retries

In case of transient failures,developerrs are expected to handle exceptions,otherwise the data will be lost

>Example of transient failure.

>Not Enough ReplicationsException

There is a “retries” setting

>defaults to 0

>you can increase to a high number, ex Integer.MAX_VALUE

· In case of retries, by default, there is a chance that messages will be sent out of order(if a batch has failed to be sent)

· If you rely on key-based ordering, that can be an issue.

· For this, you can set the security while controls how many produce requests can be made in parallel: max.in.flight.requests.per.connection

.Default :5

.set it to 1 if you need to ensure ordering (may impact throughput)

· In kafka>=1.0.0,there a better solution!

#Idempotent Producer

*Here’s the problem: the producer can introduce duplicate messages in kafka due to network errors.

*In Kafka >=0.11 you can define an “idempotent producer” which won’t introduce duplicates on network error

*idempotent producers are great to guarantee a stable & safe pipeline!

*They come with:

#retries = Integer.MAX_VALUE (2{31-1=214783647)

#max.in.flight.requests=1(kafka >0.11<1.1) or

#max.in.flight.requests=5(kafka>1.1-higher performance)

#acks=all

#just set:

*ProducerProps.Put(“enable.idempotence”,true);

Safe producer summery & Demo

kafka<0.11

.acks=all(producer level)

.Ensures data is property replicated before an ack is received

.min.insync.replicas=2(broker/topic level)

.Encures two brokers in ISR at least have the data after an ack

.retries=MAX_INT(Producer level)

.Ensures transient errors are retried intefinetly

.max.in.flight.request.per.connection=1(producer level)

.Ensures only one request is tried at any time,preventing message re-ordering in case retries.

Kafka>=0.11

.enable.idempotence=true(Producer level) + min.insync.replicas=2(broker/topic level)

.Implies acks =all,retries = MAX_INT,max.in.flight.requests.per.connection=5(default)

.while keeping ordering guarantess & improving performance!

.Running a “safe producer”might impact throughput & latency,always test for your use case.

PRODUCER COMPRESSION

Message Compression

*Producer usually send data this is text-based, for example with JSON data.

*In this case, it is important to apply compression to the producer

*Compression is enabled at the producer level & doesn’t require any configuration change in the Brokers or in the consumers

*” Compression.type” can be ‘name(default),’gzip’, ’snappy’

*compression is more effective the bigger the batch of message being sent to kafka

*The compressed batch has the following advantage:

Much smaller producer request size(compression ratio up to 4x!)

*Faster to transfer data over the network =>less latency

*Better throughput

*Better disk utilison in Kafka (stored messages on disk are smaller)

Disadvantages(very minor):

*Producers must commit some CPU cycles to compression

*Consumers must Commit some CPU cycles to decompression

*Overall:

*Cosider testing snappy for optimal speed/compression ratio

Message Compression Recommendations

*Find a compression algorithm that gives you the best peformance for your specific data test all of them!

*Always use compression in production & especially if you have high throughput

*Consider tweaking linger.ms & batch.size to have bigger batches & therefore more compression & higher throughput

Linger.ms & batch.size

.By default,kafka tries to send records as soon as posible

>It will have up to 5 requests in flight, meaning up to 5 messages individually sent at the some time.

>After this, if more messages have to be sent while others are in flight, Kafka is smart & will start batching them while they wait to send them all at once.

>This smart batching allows kafka to increase throughput while maintaining very low latency

>Batches have higher compression ratio so better efficiency

>So how can we control the batching mechanism?

*Linger .ms: Number of milliseconds a producer is willing to wait before sending a batch out(default 0)

*By introducing some lag (for example linger.ms=5),we increase the chances of messages being sent together in a batch

*So at the expense of introducing a small delay,we can increase throughput compression & efficiency of our producer

*If a batch is full (see batch.size)before the end of the linger.ms period, it will be sent to kafka right away!

Batch Size

*batch.size:Maximum number of bytes that will be included in a batch. The default is 16 KB

*Incrreasing a batch size to something like32KB or 64KB can help to increase the compression, throughput & efficiency of requests

*Any message that is bigger than the batch size will not be batched

*A batch is allocated per partition, so make sure that you don’t set it to a number that’s too high,otherwise you run waste memory!

*(Note: you can monitor the average batch size metric using Kafka producer Metrics)

High Throughput Producer demo

*we’ll add snappy message compression in our producer

*Snappy is very helpful if your message are text based, for example, log lines or JSON documents

*Snappy has a good balance of CPU /Compression ratio

*we’ll also increase the batchsize to 32KB & introduce a small delay through linger ms (20ms)

High throughput latency expences & CPU Usage

Properties.setProperty(ProducerConfig.COMPARESSION TYPE_CONFIG,”snappy”);

Properties.setProperty(ProducerConfig.LINGER_MS_CONFIG,”20”)

Properties.setProperty(ProducerConfig.Batch_SIZE_CONFIG,Integer to String(32*1024); //32 kilobyte size

Producer Default Partition & how keys are hashed

*By default, your keys are hushed using the “murmur2” algorithm.

*It is most likely preferred to not override the behavior of the partitioner, but it is possible to do so(partitioner.class)

*The formula is:

targetPartition =utils.abs(Utils.murmur2(recrd.key()))%numpartitions;

*This means that the same key will go to the same partition (we already know this),& adding partitions to a topic will completely alter the formula.

Max.block.ms & buffer.memory

*If the producer produces faster than the broker can take,the records will be buffered in memory

*buffer.memory=33554432(32MB):the size of the send buffer

*That buffer will fill up over time & fill back down when the throughput to the broker increases

*If that buffer is full(all 32MB),then the send() method will start to block(won’t return right away)

*max.block.ms=6000:the time the .send() will block untill throwing an exception.Exceptions are basically thrown when

>The producer has filled up its buffer

>The broker is not accepting any new data

>60 seconds has elapsed

*If you hit an exception hit that usually means your brokers are down or overloaded as they can’t respond to requests.

Consumer poll Behavior

.Kafka consumers have a “poll” model,while many other messing buses in enterprises have a” push” model

.This allows consumers to control wherein the log they want to consume, how fast & gives them the ability to reply events.

Consumer poll Behaviour

*Fetch.min.bytes(default 1):

>controls how much data you want to pull at least on each request

>Helps improving throughput & decreasing request number

>At the cost of latency

*Max.poll.records(default 500):

>Controls how many records to receive per poll request

>Increase if your messages are very small & have a lot of available RAM

>Good to monitor how many records are polled per request

*Max.partitions.fetch.bytes(default 1MB)

>Maximum data returned by the broker per partition

>If you read from 100 partitions, you’ll need a lot of memory(RAM)

*Fetch.max.bytes(default 50MB)

>Maximum data returned for each fetch request(covers muliple Partitions)

*Change these setting only if your consumer maxes out on throughput already

Consumer offset commits strategies

*There are two most common patterns for committing offsets in a consumer application

*2 Strategies

>(easy) enable.auto.commit=true &synchronous processing of batches

>(medium)enable.auto.commit=false & manual commit of offsets

>enable.auto.commit = true & synchronous processsing of batches

while(true){

List<Records>batch = consumer.poll(Duration.ofMillis(100))

do sometingSynchronous(batch)

}

*with auto-commit, offsets will be committed automatically for you at regular interval

(auto.commit.interval.ms=500 by default)

every-time you call .poll()

*If you don’t use synchronous processing, you will be in “at-most-once” behavior because offsets will be committed before your data is processed.

*enable.auto.commit =false & synchronous processing of batches

while(true){

batch + = Consumer.poll(Duration.ofMillis(100)

if is Ready(batch){

doSomethingsynchronous(batch)

Consumer.CommitSync();

}

*you control when you commit offsets & whats the condition for committing them.

*Example: accumulating records into a buffer & then flushing the buffer to a database + committing offsets then.

Consumer offset Reset Behaviour

*The behavior for the consumer is to then use:

>auto.offset.reset = latest:will read the end of the log

>auto.offset.reset = earliest:will read from the start of the log

>auto.offset.reset = none:will throw exception if no offset is found

*Additionally consumer offsets can be lost:

>If a consumer hasn’t read new data in 1 day(kafka < 2.0)

>If a consumer hasn’t read new data in 7 days(kafka>=2.0)

*This can be controlled by the broker settting offset.retention.minutes

*To reply data for a consumer group:

>Take all the consumers from a specific group down

>use kafka -consumer- groups command to set offset to what you want

>Restart Consumers

Bottom line

*Set proper data retention period & offset retention period & offset retention period.

*Ensure the auto offset rest behavior is the one you except/want

kafka-consumer-groups –bootstrap-server 127.0.0.1:9092 –group kafka-demo-elasticsearch –reset-offsets –execute –to -earliest –topic manju1

Consumer Heartbeat Thread

*Heartbeats are sent periodically to the broker

*If no heartbeat is sent during what period the consumer is considered dead

*Set even lower to faster consumer rebalances

*Heartbeat.interval.ms(default 3 seconds):

>How often to send Heartbeats

>usually set to 3^rdof session.timeout.ms

*Take-away: This mechanism is used to detect a consumer application being down

Consumer Poll Thread

*max.poll.intervals.ms(default 5 minutes):

*Maximum amount of time between two .poll() calls before declaring the consumer dead.

*This is particularly relevant for BigdData frameworks like the spark in case of the processing tasks time

*Take-away: This mechanism is used to detect a data processing issue with the Consumer.

BIG DATA

Saturday 21 March 2020

Easy to Understand Kafka Importance