It's all about Apache Kafka
You might have heard about Apache Kafka, let's dig into it & explore why you should be aware of this and what it brings to the table.
Let's start with the 'ABC' of it, "you can take a sip of coffee" and roll it.
but we may have one source and multiple targets then the problem arises, for image 2, we have to maintain 2*4 integrations. And you have to care about the protocols, data schema, data format, etc
for individual integrations, integrating as many targets come along with a load on the source.
There should be a distributed messaging system needed to solve this and there comes Apache Kafka, and all sources & targets are decoupled.
Apache Kafka is developed by LinkedIn and later they donated it to Apache Software Foundation, it's written in Scala & Java. This is designed to serve unified, low-latency, high throughput for handling real-time data feeds.
Kafka is so much powerful that LinkedIn itself along with Netflix, Uber, Airbnb, Cisco, Walmart, and many more clients (2000+ and 35% of Fortune 500) consumes it extensively.
Scaling:
- It can be scaled up to 100's brokers,
- Horizontal scaling,
- Million of messages per second
Performance: Latency < 10 ms almost real-time.
Use Cases:
- Messaging Systems,
- Activity tracking,
- Metrics,
- Stream processing,
- Integration with other systems like Spark, Flink, Storm, Hadoop, etc.
Real-time examples:
- Netflix uses Kafka to apply recommendations in real-time as you're in the middle of watching something,
- Uber uses Kafka to gather user, taxi, trip data & driver allocation in real-time.
- LinkedIn uses Kafka to connect recommendations and spam filtering in real-time.
Now, what are the other options, and what difference does Kafka make?
Kafka APIs allow producers to publishing data streams to topics. A topic is a partitioned log of records with each partition being ordered and immutable. Consumers can subscribe to topics. Kafka can run on a cluster of brokers with partitions split across cluster nodes. As a result, Kafka aims to be highly scalable. However, Kafka requires extra effort by the user to configure and scale according to requirements.
Amazon Kinesis is a cloud-based real-time processing service. Kinesis producers can push data as soon as it is created to the stream. Kenesis breaks the stream across shards (similar to partitions), determined by your partition key. Each shard has a hard limit on the number of transactions and data volume per second. If you exceed this limit, you need to increase your number of shards. Much of the maintenance and configuration is hidden from the user. AWS allows ease of scaling with users only paying for what they use.
Amazon Kinesis is a cloud-based real-time processing service. Kinesis producers can push data as soon as it is created to the stream. Kenesis breaks the stream across shards (similar to partitions), determined by your partition key. Each shard has a hard limit on the number of transactions and data volume per second. If you exceed this limit, you need to increase your number of shards. Much of the maintenance and configuration is hidden from the user. AWS allows ease of scaling with users only paying for what they use.
Microsoft Azure Event Hubs Event Hubs describes itself as an event ingestor capable of receiving and processing millions of events per second. Producers send events to an event hub via AMQP or HTTPS. Event Hubs also have the concept of partitions to enable specific consumers to receive a subset of the stream. Consumers connect via AMQP. Consumer groups are used to allow consuming applications to have a separate view of the event stream. Event Hubs is a fully managed service but users must pre-purchase capacity in terms of throughput units.
Google Pub/Sub Pub/Sub, offers scalable cloud-based messaging. Publisher applications send messages to a topic with consumers subscribing to a topic. Messages are persisted in a message store until they are acknowledged. Publishers and pull-subscribers are applications that can make Google API HTTPS requests. Scaling is automatic by distributing the load across data centres. Users are charged by data volume.
The following table illustrates some of the differences which you can consider while choosing one of them.
Kafka
|
Amazon Kinesis
|
Microsoft Event Hub
|
Google pub/sub
|
|
Messaging Guarantees
|
At
least once per normal connector.
Precisely
once with Spark direct Connector.
|
At
least once unless you build deduping or idempotency into the consumers.
|
At
least once but allows consumer managed checkpoints for exactly once reads.
|
At
least once.
|
Ordering guarantees
|
Guaranteed
within a partition.
|
Guaranteed
within a shard.
|
Guaranteed
within a partition.
|
No
ordering guarantees.
|
Throughput
|
No
quoted throughput figures. The study showed a throughput of ~30,000 messages/sec.
|
One shard can support 1 MB/s input, 2 MB/s output, or 1000 records per second. The study showed a throughput of ~20,000 messages/sec.
|
Scaled
in throughput units. Each supports 1 MB/s ingress, 2 MB/s egress, or 84 GB of storage. The standard tier allows 20 throughput units.
|
Default
is 100MB/s in, 200MB/s out but the maximum is quoted as unlimited.
|
Configurable persistence period
|
No
maximum
|
1
to 7 days (default is 24 hours)
|
1
to 7 days (default is 24 hours)
|
7
days (not configurable) or until acknowledged by all subscribers.
|
Partitioning
|
Yes
|
Yes
(Shards)
|
Yes
|
Yes
- but not under user control
|
Consumer groups
|
Yes
|
Yes
(called auto-scaling groups)
|
Yes
(up to 20 for the standard pricing tier)
|
Yes
(called subscriptions)
|
Disaster recovery - with cross-region replication
|
Yes
(cluster mirroring)
|
Automatically
across 3 zones
|
Yes
(for the standard tier)
|
Yes
(automatic)
|
Maximum size of each data blob
|
Default
1MB (but can be configured)
|
1
MB
|
Default
256 KB (paid for up to 1MB)
|
10
MB
|
Change partitioning after setup
|
Yes
(increase only - does not re-partition existing data)
|
Yes
by “resharding” (merge or split shards).
|
NO
|
No
(not under user control)
|
Partition/shard limit
|
No
limit. Optimal partitions depend on your use case.
|
500
(US/EU) or 200 (other) although you can apply to Amazon to increase this.
|
Between
28 and 32 (can pay for more).
|
Not
visible to the user.
|
Latency
|
Milliseconds
for some set-ups. Benchmarking showed ~2 ms median latency.
|
200
ms to 5 seconds
|
No
quoted figures.
|
No
quoted figures.
|
Replication
|
Configurable
replicas. Acknowledgement of message published can be on send, on receipt or successful replication (local only)
|
Hidden
(across three zones). A message published acknowledgement is always after replication.
|
Configurable
(and allowed across regions for the standard tier).
|
Hidden. A message published acknowledgement after half the disks on half the clusters
have the message.
|
Push model supported
|
Pseudo
with Apache Spark or can request data using blocking long polls, so kind of
|
Yes
(via Kinesis Client Library)
|
Yes
(via AMQP 1.0)
|
Yes
|
Pull model supported
|
Yes
|
Yes
|
No
|
Yes
|
I hope you find this useful. Thanks for reading!!
Credits: Stephane Maarek, Scottlogic & Wiki.