It's all about Apache Kafka

You might have heard about Apache Kafka, let's dig into it & explore why you should be aware of this and what it brings to the table. 

Let's start with the 'ABC' of it, "you can take a sip of coffee" and roll it.




In a typical application, we have a source system & target system and data to be transferred among them. So we need to do some integrations b/w them, all well! 

      

but we may have one source and multiple targets then the problem arises, for image 2, we have to maintain 2*4 integrations. And you have to care about the protocols, data schema, data format, etc  

 for individual integrations, integrating as many targets come along with a load on the source.

There should be a distributed messaging system needed to solve this and there comes Apache Kafka, and all sources & targets are decoupled.

Apache Kafka is developed by LinkedIn and later they donated it to Apache Software Foundation, it's written in Scala & Java. This is designed to serve unified, low-latency, high throughput for handling real-time data feeds.

Kafka is so much powerful that LinkedIn itself along with Netflix, Uber, Airbnb, Cisco, Walmart, and many more clients (2000+ and 35% of Fortune 500) consumes it extensively.

Scaling:
  • It can be scaled up to 100's brokers,
  • Horizontal scaling,
  • Million of messages per second

Performance: Latency < 10 ms almost real-time.

Use Cases:
  • Messaging Systems,
  • Activity tracking,
  • Metrics,
  • Stream processing,
  • Integration with other systems like Spark, Flink, Storm, Hadoop, etc.
Real-time examples:
  • Netflix uses Kafka to apply recommendations in real-time as you're in the middle of watching something,
  • Uber uses Kafka to gather user, taxi, trip data & driver allocation in real-time.
  • LinkedIn uses Kafka to connect recommendations and spam filtering in real-time.
Now, what are the other options, and what difference does Kafka make? 

Kafka APIs allow producers to publishing data streams to topics. A topic is a partitioned log of records with each partition being ordered and immutable. Consumers can subscribe to topics. Kafka can run on a cluster of brokers with partitions split across cluster nodes. As a result, Kafka aims to be highly scalable. However, Kafka requires extra effort by the user to configure and scale according to requirements.

Amazon Kinesis is a cloud-based real-time processing service. Kinesis producers can push data as soon as it is created to the stream. Kenesis breaks the stream across shards (similar to partitions), determined by your partition key. Each shard has a hard limit on the number of transactions and data volume per second. If you exceed this limit, you need to increase your number of shards. Much of the maintenance and configuration is hidden from the user. AWS allows ease of scaling with users only paying for what they use.

Microsoft Azure Event Hubs  Event Hubs describes itself as an event ingestor capable of receiving and processing millions of events per second. Producers send events to an event hub via AMQP or HTTPS. Event Hubs also have the concept of partitions to enable specific consumers to receive a subset of the stream. Consumers connect via AMQP. Consumer groups are used to allow consuming applications to have a separate view of the event stream. Event Hubs is a fully managed service but users must pre-purchase capacity in terms of throughput units. 

Google Pub/Sub Pub/Sub, offers scalable cloud-based messaging. Publisher applications send messages to a topic with consumers subscribing to a topic. Messages are persisted in a message store until they are acknowledged. Publishers and pull-subscribers are applications that can make Google API HTTPS requests. Scaling is automatic by distributing the load across data centres. Users are charged by data volume.

The following table illustrates some of the differences which you can consider while choosing one of them.


Kafka
Amazon Kinesis
Microsoft Event Hub
Google pub/sub
Messaging Guarantees
At least once per normal connector.
Precisely once with Spark direct Connector.
At least once unless you build deduping or idempotency into the consumers.
At least once but allows consumer managed checkpoints for exactly once reads.
At least once.
Ordering guarantees
Guaranteed within a partition.
Guaranteed within a shard.
Guaranteed within a partition.
No ordering guarantees.
Throughput
No quoted throughput figures. The study showed a throughput of ~30,000 messages/sec.
One shard can support 1 MB/s input, 2 MB/s output, or 1000 records per second. The study showed a throughput of ~20,000 messages/sec.
Scaled in throughput units. Each supports 1 MB/s ingress, 2 MB/s egress, or 84 GB of storage. The standard tier allows 20 throughput units.
Default is 100MB/s in, 200MB/s out but the maximum is quoted as unlimited.
Configurable persistence period
No maximum
1 to 7 days (default is 24 hours)
1 to 7 days (default is 24 hours)
7 days (not configurable) or until acknowledged by all subscribers.
Partitioning
Yes
Yes (Shards)
Yes
Yes - but not under user control
Consumer groups
Yes
Yes (called auto-scaling groups)
Yes (up to 20 for the standard pricing tier)
Yes (called subscriptions)
Disaster recovery - with cross-region replication
Yes (cluster mirroring)
Automatically across 3 zones
Yes (for the standard tier)
Yes (automatic)
Maximum size of each data blob
Default 1MB (but can be configured)
1 MB
Default 256 KB (paid for up to 1MB)
10 MB
Change partitioning after setup
Yes (increase only - does not re-partition existing data)
Yes by “resharding” (merge or split shards).
NO
No (not under user control)
Partition/shard limit
No limit. Optimal partitions depend on your use case.
500 (US/EU) or 200 (other) although you can apply to Amazon to increase this.
Between 28 and 32 (can pay for more).
Not visible to the user.
Latency
Milliseconds for some set-ups. Benchmarking showed ~2 ms median latency.
200 ms to 5 seconds
No quoted figures.
No quoted figures.
Replication
Configurable replicas. Acknowledgement of message published can be on send, on receipt or successful replication (local only)
Hidden (across three zones). A message published acknowledgement is always after replication.
Configurable (and allowed across regions for the standard tier).
Hidden. A message published acknowledgement after half the disks on half the clusters have the message.
Push model supported
Pseudo with Apache Spark or can request data using blocking long polls, so kind of
Yes (via Kinesis Client Library)
Yes (via AMQP 1.0)
Yes
Pull model supported
Yes
Yes
No
Yes

I hope you find this useful. Thanks for reading!!

Credits: Stephane Maarek, Scottlogic & Wiki.

Popular Posts