Do you need millisecond-level event processing, or is a near real-time pipeline sufficient? Should you prioritize scalability and throughput, or do you require deep integration with a data lakehouse? And most importantly, which engine will scale efficiently as your data volume grows?
Let me break it to you in just the second paragraph—it all depends on your use case. However, when it comes to that, choosing an engine that can efficiently handle growing data volumes with the right balance of relevant features is critical.
The goal of this blog is to unpack the definition of streaming, address common misconceptions, explore the internals of stream processing engines, and compare the trade-offs in their designs and how they impact streaming data pipelines. More importantly, we’ll highlight why relying on a single engine for all your needs may not be enough—as workloads evolve, leveraging multiple engines can provide the flexibility needed to optimize for both performance and cost. Maybe, just maybe, streaming is just a faster batch pipeline. Let’s find out!
If you are an engineer already well-versed in stream processing, feel free to scroll down right to the comparison tables.
A streaming engine is a specialized data processing system designed to handle continuous streams of data in real-time or near real-time. These engines power critical applications where low latency, high throughput, and scalability are key, such as fraud detection in financial transactions and real-time recommendations in e-commerce.
Data processing architectures have evolved significantly over time:
When discussing streaming engines, you’ll often hear the terms “real-time,” “near real-time,” or “micro-batch” - each defining a different level of latency and processing approach. At Onehouse, we deliver a related, but different, incremental processing model.
Before diving further into stream processing engines, let's step back and look at the broader data landscape.
Open source engines such as Apache Flink™, Apache Beam™, Apache Samza™, Spark Structured Streaming, and Apache Storm™ enable organizations to build custom real-time processing solutions. Managed offerings such as Ververica, Confluent, Stream Native, Databricks, and others provide fully managed solutions for Flink, Kafka Streams, Pulsar, and Spark respectively, making it easier to scale streaming workloads without operational overhead.
Streaming engines can be broadly categorized based on their processing models:
Real-time and near real-time engines process each event as it arrives with minimal delay, typically in milliseconds to seconds. Example use cases include fraud detection in banking transactions.
In this blog, we cover two such engines in depth, namely Apache Flink and Kafka Streams. We chose these engines because of their popularity in the data processing ecosystem
Apache Flink, initially released in May 2011, is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments and perform computations at in-memory speed at scale. Flink provides a high-throughput, low-latency streaming engine and support for event-time processing and state management. It was developed for streaming-first use cases and later added a unified programming interface for both stream and batch processing.
Kafka Streams (added in Kafka’s 0.10.0.0 release) is a lightweight stream-processing library for building applications and microservices, where input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka’s server-side cluster technology. Kafka Streams has a low barrier to entry as it lets you quickly write and run a small-scale application on a single machine and only requires running additional instances of the application on multiple machines to scale up to high-volume production workloads. The key design choice in Kafka Streams is that the input and output data is always a Kafka topic.
Micro-batch engines lie in the middle of streaming and batch processing engines, dealing with small, frequent batches that approximate streaming behavior but are processed in intervals. In this blog, we will cover Spark’s Structured Streaming in depth and also see how it compares with the real-time streaming engines.
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It lets users express their streaming computation the same way they would express a batch computation on static data. The Spark Streaming engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. Users can also use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc.
Knowing what these tools are is just the beginning of the journey. In this section, we will understand the different concepts in stream processing, at greater depth, given this is still a growing and newer technology category.
You’ll learn:
At its core, stream processing isn’t just about reacting to individual events; it’s about understanding those events in context. And context requires state.
Stateful processing allows a streaming engine to remember information across events. Without it, each incoming data point is treated in isolation, making it impossible to perform essential operations such as:
In other words, state is what turns raw events into meaningful insights.
Stateful processing is critical for remembering information across events. But to remember, you need the state to be stored somewhere (even if it’s temporary). And that’s where the state store comes in. Think of a state store as the internal memory of a stream processing job.
It holds intermediate values like:
In systems such as Kafka Streams, each application instance maintains a local state store—usually backed by RocksDB or in-memory hash maps. Structured Streaming supports stateful processing through the HDFS state store provider or RocksDB. Flink uses the term State Backend to refer to its infrastructure implementation that manages how state is stored, accessed, and recovered during stream processing. It provides two forms of state backends, HashMapStateBackend and EmbeddedRocksDBStateBackend.
When it comes to choosing a state store (or backend) for production workloads where durability, recoverability from failure, low latency, and scalability are key, it’s important to consider choosing a persistent state store (or backend) such as RocksDB to unlock durability and scale.
Here’s a simple representation of how you can use Flink to query the state:
While state stores (or backends) manage where and how state is held during processing, checkpointing ensures that state and stream progress are not lost in the event of a failure.
Checkpointing is a mechanism for periodically saving a snapshot of the application’s state and its position in input sources at a point in time. If the application crashes, it can recover by restoring from the last checkpoint, resuming processing without data loss or duplication.
Here’s how different engines implement checkpointing and state recovery:
The combination of stateful processing, durable state stores, and checkpointing provides robust fault tolerance. It ensures that even if a node or job fails:
Flink and Spark rely on checkpointing + distributed snapshots, while Kafka Streams relies on changelog replication to persist and recover local state.
In batch processing, time isn’t a major consideration—you're working with static data. But in stream processing, time is dynamic and events can arrive: late (due to network delays, retries, etc); out of order; in real-time, or; after-the-fact (backfills/corrections).
This is where time semantics come in. Choosing the right time model i.e. event time or processing time, impacts everything from accuracy to latency to correctness of results.
For example, imagine you're aggregating user activity into hourly windows. If you use processing time, a late event might fall into the wrong window—or be dropped altogether. With event time, the system can place the event in the correct window, even if it arrives late.
In stream processing, data doesn’t always flow smoothly. Sometimes, producers generate events faster than the system can process them - this mismatch is known as backpressure. Without proper handling, this leads to memory overflows, increased latency or worse – the job crashes.
In other words, backpressure is what keeps your pipeline from cracking under pressure.
Delivery guarantee semantics define the level of assurance provided by a messaging system or delivery protocol. These guarantees are about message order (delivery and processing), delivery reliability, duplication allowance and so on. In other words, delivery semantics determine how exactly a message will be handled in terms of delivery.
There are three guaranteed types of delivery:
If you’d like to read more about each one of these guarantees with examples, this blog covers them well. Naturally, exactly-once is the gold standard of delivery guarantees provided by the engine and is harder to achieve.
Exactly-once semantic is a guarantee that each event is processed only once, even in case of failures or retries, ensuring data consistency. The below diagram illustrates how Flink manages end-to-end exactly-once semantics.
A more recent development is the ability to query the state store directly. Flink and Kafka Streams’ interactive query features allow you to leverage the state of your application from outside your application.
To get the full state of your application, you must connect the various fragments of the state. You’ll need to:
Flink’s queryable state: Apache Flink’s queryable state allows external applications to directly query the state of a running Flink job in real time. This feature is crucial for low-latency decision-making use cases such as fraud detection, anomaly detection, and real-time monitoring.
Kafka Streams’ interactive queries: Similar to Flink, interactive queries with Kafka Streams enable real-time lookups into streaming state, allowing applications to query and retrieve live results without external storage.
In a perfect world, all events in a stream would arrive in order, right after they occur. But in reality, network delays, retries, batching, and system lag often cause events to arrive late or out of order. That’s where watermarks come in. A watermark is a mechanism used in event-time stream processing to track the progress of time in an unbounded, asynchronous stream.
Without watermarks, your metrics may end up being inaccurate because
Essentially, watermarks allow stream processors to:
In stream processing, data is often infinite and continuous, but to analyze or aggregate it, you need to group it into finite chunks. That’s what windowing is all about.
Windowing lets you break a stream into time-based or event-driven slices, thus enabling operations such as “Total sales every 5 minutes," “Average session time per user," and “Top N products viewed in the last hour." Without windowing, it’s nearly impossible to compute meaningful aggregations or time-bound insights on an unbounded stream.
Now, if you’ve ever wondered why Flink handles stateful operations so well, or how Spark’s micro-batching affects performance, this is where it all clicks.
In this blog I divide up categories of features, describe what the feature should do and then I rate each streaming engine with an A, B or C based on how well each engine delivers on that feature. Each feature is unique and I roughly give the letters as follows: A = best solution, B = good solution, C = barely there. After the 1-by-1 analysis I also created a metascore where I awarded 3pts for every A, 2 for a B, 1 for a C. The metascore is a normalized sum of all points across all features and categories. Below are the rankings from the metascore:
Is this a perfect scientifically measured score? No… for example, perhaps not all features should be given equal weights? Is it possible to create an objectively perfect score? I don’t think so… these are not products that you can run some queries, count the seconds it ran for and chalk up a TPC-DS. As mentioned before, what features matter to you, might not matter to another person. So take my analysis as a starting point to do your own research. A few of the ratings below could be interpreted as subjective, if you disagree with a rating, reach out to me and I would love to learn from your perspective and perhaps adapt my conclusions.
Let’s get into the guts of it, with our first comparison table.
🏆 Best in class: Apache Flink
❌ Worst in class: Structured Streaming
From a dev-ex standpoint, Spark Structured Streaming offers the most APIs and is compatible with the widely popular Apache Spark™ ecosystem supporting SQL, Python, Scala, and Java, and several other libraries. While Flink follows closely with similar support in terms of programming languages, Kafka Streams lags with only Java and Scala support.
Monitoring and observability are also key differentiators; Flink and Spark provide detailed UI dashboards and native observability integrations, whereas Kafka Streams relies on Confluent Monitoring for visibility.
For teams looking for rapid development, minimal setup, and deep ecosystem integration, Spark Structured Streaming remains the top choice. Flink is ideal for performance-driven applications that require fine-grained control, while Kafka Streams is best for lightweight, event-driven Kafka-native microservices.
🏆 Best in class: Apache Spark
❌ Worst in class: Kafka Streams