It's all about Apache Kafka

September 04, 2017

It's all about Apache Kafka

You might have heard about Apache Kafka, let's dig into it & explore why you should be aware of this and what it brings to the table.

Let's start with the 'ABC' of it, "you can take a sip of coffee" and roll it.

In a typical application, we have a source system & target system and data to be transferred among them. So we need to do some integrations b/w them, all well!

but we may have one source and multiple targets then the problem arises, for image 2, we have to maintain 2*4 integrations. And you have to care about the protocols, data schema, data format, etc

for individual integrations, integrating as many targets come along with a load on the source.

There should be a distributed messaging system needed to solve this and there comes Apache Kafka, and all sources & targets are decoupled.

Apache Kafka is developed by LinkedIn and later they donated it to Apache Software Foundation, it's written in Scala & Java. This is designed to serve unified, low-latency, high throughput for handling real-time data feeds.

Kafka is so much powerful that LinkedIn itself along with Netflix, Uber, Airbnb, Cisco, Walmart, and many more clients (2000+ and 35% of Fortune 500) consumes it extensively.

Scaling:

It can be scaled up to 100's brokers,
Horizontal scaling,
Million of messages per second

Performance: Latency < 10 ms almost real-time.

Use Cases:

Messaging Systems,
Activity tracking,
Metrics,
Stream processing,
Integration with other systems like Spark, Flink, Storm, Hadoop, etc.

Real-time examples:

Netflix uses Kafka to apply recommendations in real-time as you're in the middle of watching something,
Uber uses Kafka to gather user, taxi, trip data & driver allocation in real-time.
LinkedIn uses Kafka to connect recommendations and spam filtering in real-time.

Now, what are the other options, and what difference does Kafka make?

Kafka APIs allow producers to publishing data streams to topics. A topic is a partitioned log of records with each partition being ordered and immutable. Consumers can subscribe to topics. Kafka can run on a cluster of brokers with partitions split across cluster nodes. As a result, Kafka aims to be highly scalable. However, Kafka requires extra effort by the user to configure and scale according to requirements.

Amazon Kinesis is a cloud-based real-time processing service. Kinesis producers can push data as soon as it is created to the stream. Kenesis breaks the stream across shards (similar to partitions), determined by your partition key. Each shard has a hard limit on the number of transactions and data volume per second. If you exceed this limit, you need to increase your number of shards. Much of the maintenance and configuration is hidden from the user. AWS allows ease of scaling with users only paying for what they use.

Microsoft Azure Event Hubs Event Hubs describes itself as an event ingestor capable of receiving and processing millions of events per second. Producers send events to an event hub via AMQP or HTTPS. Event Hubs also have the concept of partitions to enable specific consumers to receive a subset of the stream. Consumers connect via AMQP. Consumer groups are used to allow consuming applications to have a separate view of the event stream. Event Hubs is a fully managed service but users must pre-purchase capacity in terms of throughput units.

Google Pub/Sub Pub/Sub, offers scalable cloud-based messaging. Publisher applications send messages to a topic with consumers subscribing to a topic. Messages are persisted in a message store until they are acknowledged. Publishers and pull-subscribers are applications that can make Google API HTTPS requests. Scaling is automatic by distributing the load across data centres. Users are charged by data volume.

The following table illustrates some of the differences which you can consider while choosing one of them.

	Kafka	Amazon Kinesis	Microsoft Event Hub	Google pub/sub
Messaging Guarantees	At least once per normal connector. Precisely once with Spark direct Connector.	At least once unless you build deduping or idempotency into the consumers.	At least once but allows consumer managed checkpoints for exactly once reads.	At least once.
Ordering guarantees	Guaranteed within a partition.	Guaranteed within a shard.	Guaranteed within a partition.	No ordering guarantees.
Throughput	No quoted throughput figures. The study showed a throughput of ~30,000 messages/sec.	One shard can support 1 MB/s input, 2 MB/s output, or 1000 records per second. The study showed a throughput of ~20,000 messages/sec.	Scaled in throughput units. Each supports 1 MB/s ingress, 2 MB/s egress, or 84 GB of storage. The standard tier allows 20 throughput units.	Default is 100MB/s in, 200MB/s out but the maximum is quoted as unlimited.
Configurable persistence period	No maximum	1 to 7 days (default is 24 hours)	1 to 7 days (default is 24 hours)	7 days (not configurable) or until acknowledged by all subscribers.
Partitioning	Yes	Yes (Shards)	Yes	Yes - but not under user control
Consumer groups	Yes	Yes (called auto-scaling groups)	Yes (up to 20 for the standard pricing tier)	Yes (called subscriptions)
Disaster recovery - with cross-region replication	Yes (cluster mirroring)	Automatically across 3 zones	Yes (for the standard tier)	Yes (automatic)
Maximum size of each data blob	Default 1MB (but can be configured)	1 MB	Default 256 KB (paid for up to 1MB)	10 MB
Change partitioning after setup	Yes (increase only - does not re-partition existing data)	Yes by “resharding” (merge or split shards).	NO	No (not under user control)
Partition/shard limit	No limit. Optimal partitions depend on your use case.	500 (US/EU) or 200 (other) although you can apply to Amazon to increase this.	Between 28 and 32 (can pay for more).	Not visible to the user.
Latency	Milliseconds for some set-ups. Benchmarking showed ~2 ms median latency.	200 ms to 5 seconds	No quoted figures.	No quoted figures.
Replication	Configurable replicas. Acknowledgement of message published can be on send, on receipt or successful replication (local only)	Hidden (across three zones). A message published acknowledgement is always after replication.	Configurable (and allowed across regions for the standard tier).	Hidden. A message published acknowledgement after half the disks on half the clusters have the message.
Push model supported	Pseudo with Apache Spark or can request data using blocking long polls, so kind of	Yes (via Kinesis Client Library)	Yes (via AMQP 1.0)	Yes
Pull model supported	Yes	Yes	No	Yes

I hope you find this useful. Thanks for reading!!

Credits: Stephane Maarek, Scottlogic & Wiki.

Search This Blog

De-Code

It's all about Apache Kafka

Popular Posts

Overview of Azure Sentinal