Skip to main content

It's all about Apache Kafka

You might have heard about Apache Kafka, let's dig into it & explore why you should be aware of this and what it brings to the table. 

Let's start with 'ABC' of it, "you can take a sip of coffee" and roll it.

In a typical application, we have a source system & target system and data to be transferred among them. So we need to do some integrations b/w them, all well! 

      

but we may have one source and multiple targets then the problem arises, for image 2, we have to maintain 2*4 integrations. And you have to care about the protocols, data schema, data format, etc  

 for individual integrations, integrating as many targets come along with a load on source.

There should be a distributed messaging system needed to solve this and there comes Apache Kafka, and all source & target are decoupled.

Apache Kafka is developed by LinkedIn and later they donated it to Apache Software Foundation, it's written in Scala & Java. This is designed to serve unified, low-latency, high throughput for handling real-time data feeds.

Kafka is so much powerful that LinkedIn itself along with Netflix, Airbnb. Uber, Cisco, Walmart, and many more clients (2000+ and 35% of Fortune 500) using it extensively.

Scaling:
  • It can be scaled up to 100's of brokers,
  • Horizontal scaling,
  • Million of messages per second

Performance: Latency < 10 ms almost real-time.

Use Cases:
  • Messaging Systems,
  • Activity tracking,
  • Metrics,
  • Stream processing,
  • Integration with other systems like Spark, Flink, Storm, Hadoop, etc.
Real-time examples:
  • Netflix uses Kafka to apply recommendations in real-time as you're in the middle of watching something,
  • Uber uses Kafka to gather user, taxi, trip data & driver allocation in real-time.
  • LinkedIn uses Kafka to connect recommendations and spam filtering in real-time.
Now, what are the other options, and what difference Kafka make? 

Kafka APIs allow producers to publishing data streams to topics. A topic is a partitioned log of records with each partition being ordered and immutable. Consumers can subscribe to topics. Kafka can run on a cluster of brokers with partitions split across cluster nodes. As a result, Kafka aims to be highly scalable. However, Kafka requires extra effort by the user to configure and scale according to requirements.

Amazon Kinesis is a cloud-based real-time processing service. Kinesis producers can push data as soon as it is created to the stream. Kenesis breaks the stream across shards (similar to partitions), determined by your partition key. Each shard has a hard limit on the number of transactions and data volume per second. If you exceed this limit, you need to increase your number of shards. Much of the maintenance and configuration is hidden from the user. AWS allows ease of scaling with users only paying for what they use.

Microsoft Azure Event Hubs  Event Hubs describes itself as an event ingestor capable of receiving and processing millions of events per second. Producers send events to an event hub via AMQP or HTTPS. Event Hubs also have the concept of partitions to enable specific consumers to receive a subset of the stream. Consumers connect via AMQP. Consumer groups are used to allow consuming applications to have a separate view of the event stream. Event Hubs is a fully managed service but users must pre-purchase capacity in terms of throughput units. 

Google Pub/Sub Pub/Sub, offers scalable cloud-based messaging. Publisher applications send messages to a topic with consumers subscribing to a topic. Messages are persisted in a message store until they are acknowledged. Publishers and pull-subscribers are applications that can make Google API HTTPS requests. Scaling is automatic by distributing the load across data centers. Users are charged by data volume.

The following table illustrates some of the differences which you can consider while choosing one of them.


Kafka
Amazon Kinesis
Microsoft Event Hub
Google pub/sub
Messaging Guarantees
At least once per normal connector.
Precisely once with Spark direct Connector.
At least once unless you build deduping or idempotency into the consumers.
At least once but allows consumer managed checkpoints for exactly once reads.
At least once.
Ordering guarantees
Guaranteed within a partition.
Guaranteed within a shard.
Guaranteed within partition.
No ordering guarantees.
Throughput
No quoted throughput figures. The study showed a throughput of ~30,000 messages/sec.
One a shard can support 1 MB/s input, 2 MB/s output, or 1000 records per second. The study showed a throughput of ~20,000 messages/sec.
Scaled in throughput units. Each supporting 1 MB/s ingress, 2 MB/s egress, or 84 GB storage. The standard tier allows 20 throughput units.
Default is 100MB/s in, 200MB/s out but the maximum is quoted as unlimited.
Configurable persistence period
No maximum
1 to 7 days (default is 24 hours)
1 to 7 days (default is 24 hours)
7 days (not configurable) or until acknowledged by all subscribers.
Partitioning
Yes
Yes (Shards)
Yes
Yes - but not under user control
Consumer groups
Yes
Yes (called auto-scaling groups)
Yes (up to 20 for the standard pricing tier)
Yes (called subscriptions)
Disaster recovery - with across region replication
Yes (cluster mirroring)
Automatically across 3 zones
Yes (for the standard tier)
Yes (automatic)
Maximum size of each data blob
Default 1MB (but can be configured)
1 MB
Default 256 KB (paid for up to 1MB)
10 MB
Change partitioning after setup
Yes (increase only - does not re-partition existing data)
Yes by “resharding” (merge or split shards).
NO
No (not under user control)
Partition/shard limit
No limit. Optimal partitions depend on your use case.
500 (US/EU) or 200 (other) although you can apply to Amazon to increase this.
Between 28 and 32 (can pay for more).
Not visible to the user.
Latency
Milliseconds for some set-ups. Benchmarking showed ~2 ms median latency.
200 ms to 5 seconds
No quoted figures.
No quoted figures.
Replication
Configurable replicas. Acknowledgment of message published can be on send, on receipt or on successful replication (local only)
Hidden (across three zones). A message published acknowledgment is always after replication.
Configurable (and allowed across regions for the standard tier).
Hidden. A message published acknowledgment after half the disks on half the clusters have the message.
Push model supported
Pseudo with Apache Spark or can request data using blocking long polls, so kind of
Yes (via Kinesis Client Library)
Yes (via AMQP 1.0)
Yes
Pull model supported
Yes
Yes
No
Yes

I hope you find this useful. Thanks for reading!!

Credits: Stephane Maarek, Scottlogic & Wiki.

Comments

Post a Comment

Popular posts from this blog

SPA on Azure

Single-page applications (SPAs) or Static websites are applications/websites which don't need any Client-Server model to serve requests from the pages or simply, are just HTML pages hosted on file servers. In the current world, we have a lot of hosting providers or cloud capabilities by many big players but what will be best that suits your budget and needs. For example: we have an 'index.html' page, but to host it, Do we really need to spend around $10-15 per month for the cheapest server? Let's directly jump on to a solution that is much efficient, reliable, and highly available worldwide. I am talking about Azure Static Websites which consists of the following components: * Azure storage, * Azure CDN - Premium Microsoft CDN, * Domain Let say, you purchase a server from any hosting provider or deploy your website on any cloud that will end up at a higher cost even in case of no visitors. But don't you worry about that, with Azure static websites,

Overview of Azure Sentinal

On 26th Sept, Microsoft announced 'Azure Sentinal' cloud-born SIEM in GA. Here below some of the key facts, you must be aware of, related to security. Azure Sentinal is a cloud-based SIEM build with AI & ML which analyzes the TBs of data in minutes and prompt you about any security-related inconsistency followed by defining actions. No matter your applications, users, servers, and devices are on the hybrid, on-prem and any cloud other than Azure, all can be integrated using built-in connectors. It enables you to bring your own insights, tailored detection, machine learning models, and threat intelligence. You can configure alerts, playbook, logic app flows as your actions on detected threats. Use cases In case, you have clients who have more interaction with government officials through their applications may have a threat of stealing data from malicious users. Accounts may be brute-forced for such tenants to gain privileged access. Using SIEM as a servic