Do you need millisecond-level event processing, or is a near real-time pipeline sufficient? Should you prioritize scalability and throughput, or do you require deep integration with a data lakehouse? And most importantly, which engine will scale efficiently as your data volume grows?
Let me break it to you in just the second paragraph—it all depends on your use case. However, when it comes to that, choosing an engine that can efficiently handle growing data volumes with the right balance of relevant features is critical.
The goal of this blog is to unpack the definition of streaming, address common misconceptions, explore the internals of stream processing engines, and compare the trade-offs in their designs and how they impact streaming data pipelines. More importantly, we’ll highlight why relying on a single engine for all your needs may not be enough—as workloads evolve, leveraging multiple engines can provide the flexibility needed to optimize for both performance and cost. Maybe, just maybe, streaming is just a faster batch pipeline. Let’s find out!
If you are an engineer already well-versed in stream processing, feel free to scroll down right to the comparison tables.
A streaming engine is a specialized data processing system designed to handle continuous streams of data in real-time or near real-time. These engines power critical applications where low latency, high throughput, and scalability are key, such as fraud detection in financial transactions and real-time recommendations in e-commerce.
Data processing architectures have evolved significantly over time:
When discussing streaming engines, you’ll often hear the terms “real-time,” “near real-time,” or “micro-batch” - each defining a different level of latency and processing approach. At Onehouse, we deliver a related, but different, incremental processing model.
Before diving further into stream processing engines, let's step back and look at the broader data landscape.
Open source engines such as Apache Flink, Apache Beam™, Apache Samza™, Spark Structured Streaming, and Apache Storm™ enable organizations to build custom real-time processing solutions. Managed offerings such as Ververica, Confluent, Stream Native, Databricks, and others provide fully managed solutions for Flink, Kafka Streams, Pulsar, and Spark respectively, making it easier to scale streaming workloads without operational overhead.
Streaming engines can be broadly categorized based on their processing models:
Real-time and near real-time engines process each event as it arrives with minimal delay, typically in milliseconds to seconds. Example use cases include fraud detection in banking transactions.
In this blog, we cover two such engines in depth, namely Apache Flink and Kafka Streams. We chose these engines because of their popularity in the data processing ecosystem
Apache Flink, initially released in May 2011, is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments and perform computations at in-memory speed at scale. Flink provides a high-throughput, low-latency streaming engine and support for event-time processing and state management. It was developed for streaming-first use cases and later added a unified programming interface for both stream and batch processing.
Kafka Streams (added in Kafka’s 0.10.0.0 release) is a lightweight stream-processing library for building applications and microservices, where input and output data are stored in Kafka clusters. It combines the simplicity of writing and deploying standard Java and Scala applications on the client side with the benefits of Kafka’s server-side cluster technology. Kafka Streams has a low barrier to entry as it lets you quickly write and run a small-scale application on a single machine and only requires running additional instances of the application on multiple machines to scale up to high-volume production workloads. The key design choice in Kafka Streams is that the input and output data is always a Kafka topic.
Micro-batch engines lie in the middle of streaming and batch processing engines, dealing with small, frequent batches that approximate streaming behavior but are processed in intervals. In this blog, we will cover Spark’s Structured Streaming in depth and also see how it compares with the real-time streaming engines.
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It lets users express their streaming computation the same way they would express a batch computation on static data. The Spark Streaming engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. Users can also use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc.
Knowing what these tools are is just the beginning of the journey. In this section, we will understand the different concepts in stream processing, at greater depth, given this is still a growing and newer technology category.
You’ll learn:
At its core, stream processing isn’t just about reacting to individual events; it’s about understanding those events in context. And context requires state.
Stateful processing allows a streaming engine to remember information across events. Without it, each incoming data point is treated in isolation, making it impossible to perform essential operations such as:
In other words, state is what turns raw events into meaningful insights.
Stateful processing is critical for remembering information across events. But to remember, you need the state to be stored somewhere (even if it’s temporary). And that’s where the state store comes in. Think of a state store as the internal memory of a stream processing job.
It holds intermediate values like:
In systems such as Kafka Streams, each application instance maintains a local state store—usually backed by RocksDB or in-memory hash maps. Structured Streaming supports stateful processing through the HDFS state store provider or RocksDB. Flink uses the term State Backend to refer to its infrastructure implementation that manages how state is stored, accessed, and recovered during stream processing. It provides two forms of state backends, HashMapStateBackend and EmbeddedRocksDBStateBackend.
When it comes to choosing a state store (or backend) for production workloads where durability, recoverability from failure, low latency, and scalability are key, it’s important to consider choosing a persistent state store (or backend) such as RocksDB to unlock durability and scale.
Here’s a simple representation of how you can use Flink to query the state:
While state stores (or backends) manage where and how state is held during processing, checkpointing ensures that state and stream progress are not lost in the event of a failure.
Checkpointing is a mechanism for periodically saving a snapshot of the application’s state and its position in input sources at a point in time. If the application crashes, it can recover by restoring from the last checkpoint, resuming processing without data loss or duplication.
Here’s how different engines implement checkpointing and state recovery:
The combination of stateful processing, durable state stores, and checkpointing provides robust fault tolerance. It ensures that even if a node or job fails:
Flink and Spark rely on checkpointing + distributed snapshots, while Kafka Streams relies on changelog replication to persist and recover local state.
In batch processing, time isn’t a major consideration—you're working with static data. But in stream processing, time is dynamic and events can arrive: late (due to network delays, retries, etc); out of order; in real-time, or; after-the-fact (backfills/corrections).
This is where time semantics come in. Choosing the right time model i.e. event time or processing time, impacts everything from accuracy to latency to correctness of results.
For example, imagine you're aggregating user activity into hourly windows. If you use processing time, a late event might fall into the wrong window—or be dropped altogether. With event time, the system can place the event in the correct window, even if it arrives late.
In stream processing, data doesn’t always flow smoothly. Sometimes, producers generate events faster than the system can process them - this mismatch is known as backpressure. Without proper handling, this leads to memory overflows, increased latency or worse – the job crashes.
In other words, backpressure is what keeps your pipeline from cracking under pressure.
Delivery guarantee semantics define the level of assurance provided by a messaging system or delivery protocol. These guarantees are about message order (delivery and processing), delivery reliability, duplication allowance and so on. In other words, delivery semantics determine how exactly a message will be handled in terms of delivery.
There are three guaranteed types of delivery:
If you’d like to read more about each one of these guarantees with examples, this blog covers them well. Naturally, exactly-once is the gold standard of delivery guarantees provided by the engine and is harder to achieve.
Exactly-once semantic is a guarantee that each event is processed only once, even in case of failures or retries, ensuring data consistency. The below diagram illustrates how Flink manages end-to-end exactly-once semantics.
A more recent development is the ability to query the state store directly. Flink and Kafka Streams’ interactive query features allow you to leverage the state of your application from outside your application.
To get the full state of your application, you must connect the various fragments of the state. You’ll need to:
Flink’s queryable state: Apache Flink’s queryable state allows external applications to directly query the state of a running Flink job in real time. This feature is crucial for low-latency decision-making use cases such as fraud detection, anomaly detection, and real-time monitoring.
Kafka Streams’ interactive queries: Similar to Flink, interactive queries with Kafka Streams enable real-time lookups into streaming state, allowing applications to query and retrieve live results without external storage.
In a perfect world, all events in a stream would arrive in order, right after they occur. But in reality, network delays, retries, batching, and system lag often cause events to arrive late or out of order. That’s where watermarks come in. A watermark is a mechanism used in event-time stream processing to track the progress of time in an unbounded, asynchronous stream.
Without watermarks, your metrics may end up being inaccurate because
Essentially, watermarks allow stream processors to:
In stream processing, data is often infinite and continuous, but to analyze or aggregate it, you need to group it into finite chunks. That’s what windowing is all about.
Windowing lets you break a stream into time-based or event-driven slices, thus enabling operations such as “Total sales every 5 minutes," “Average session time per user," and “Top N products viewed in the last hour." Without windowing, it’s nearly impossible to compute meaningful aggregations or time-bound insights on an unbounded stream.
Now, if you’ve ever wondered why Flink handles stateful operations so well, or how Spark’s micro-batching affects performance, this is where it all clicks.
In this blog I divide up categories of features, describe what the feature should do and then I rate each streaming engine with an A, B or C based on how well each engine delivers on that feature. Each feature is unique and I roughly give the letters as follows: A = best solution, B = good solution, C = barely there. After the 1-by-1 analysis I also created a metascore where I awarded 3pts for every A, 2 for a B, 1 for a C. The metascore is a normalized sum of all points across all features and categories. Below are the rankings from the metascore:
Is this a perfect scientifically measured score? No… for example, perhaps not all features should be given equal weights? Is it possible to create an objectively perfect score? I don’t think so… these are not products that you can run some queries, count the seconds it ran for and chalk up a TPC-DS. As mentioned before, what features matter to you, might not matter to another person. So take my analysis as a starting point to do your own research. A few of the ratings below could be interpreted as subjective, if you disagree with a rating, reach out to me and I would love to learn from your perspective and perhaps adapt my conclusions.
Let’s get into the guts of it, with our first comparison table.
🏆 Best in class: Apache Flink
❌ Worst in class: Structured Streaming
From a dev-ex standpoint, Spark Structured Streaming offers the most APIs and is compatible with the widely popular Apache Spark ecosystem supporting SQL, Python, Scala, and Java, and several other libraries. While Flink follows closely with similar support in terms of programming languages, Kafka Streams lags with only Java and Scala support.
Monitoring and observability are also key differentiators; Flink and Spark provide detailed UI dashboards and native observability integrations, whereas Kafka Streams relies on Confluent Monitoring for visibility.
For teams looking for rapid development, minimal setup, and deep ecosystem integration, Spark Structured Streaming remains the top choice. Flink is ideal for performance-driven applications that require fine-grained control, while Kafka Streams is best for lightweight, event-driven Kafka-native microservices.
🏆 Best in class: Apache Spark
❌ Worst in class: Kafka Streams
When it comes to measuring performance, we don’t assume all streaming benchmarks are designed to accurately reflect real-world scenarios, but for what it’s worth, Flink appears to outperform Spark and Kafka Streams in the Yahoo benchmark after correcting for bugs and configuration issues. We have cited some other streaming vendors here that don’t use any of our three engines. While we cannot say anything about the performance claims on their own streaming engines, we thought it will baseline our three engines relatively neutrally.
With the disclaimers out of the way, while reading into several benchmarks, generally Apache Flink stands out as the best among the three, delivering low-latency, high-throughput event processing with fine-grained resource management and auto-scaling capabilities. Flink’s adaptive scheduler and auto-tuning features make it particularly well-suited for large-scale, real-time applications, ensuring efficient CPU and memory utilization while dynamically adjusting to workload demands.
Generally, while Kafka Streams benefits from Kafka’s inherent scalability, it lacks native dynamic load balancing, requiring additional operational overhead to load balance the application. On the other end, Kafka Streams ranked the weakest in Databricks’ benchmark with the way the benchmark was set up due to the reliance on Kafka for data generation.
Spark offers dynamic load balancing, but it is still bound by the micro-batch model, which can introduce higher latency and performance bottlenecks under certain workloads. According to this benchmark, Spark Structured Streaming ranks the weakest for latency (for a throughput spectrum).
For teams prioritizing ultra-low latency, real-time responsiveness, and efficient resource management, Flink is the superior choice. Kafka Streams remains best suited for lightweight, Kafka-native workloads, whereas Spark Structured Streaming is ideal for batch-heavy, analytics-driven use cases that can tolerate comparatively higher latencies.
One day, maybe we will do our own benchmarks and get to the truth. Until then, test your own actual workloads!
🏆 Best in class: Apache Flink
❌ Worst in class: Kafka Streams
When it comes to ecosystem integration, Spark Structured Streaming is best-in-class, offering deep integration with data catalogs, broad cloud vendor support, and compatibility with all major file and lakehouse table formats. Spark, and in extension Structured Streaming, seamlessly connects with Hive, Glue, Unity Catalog, and Polaris, making it the preferred choice for lakehouse architectures and analytics-driven workloads. It is also supported across AWS, GCP, Azure, and Databricks, providing first-class cloud integration out-of-the-box.
On the other hand, Kafka Streams ranks the weakest due to its limited native catalog support, lack of direct file format integration, and dependency on Kafka-native file formats (Avro, JSON, Protobuf). Unlike Flink and Spark, which support multiple lakehouse table formats (Delta, Iceberg, Hudi), Kafka Streams requires additional components (e.g., Confluent Tableflow) to integrate with modern data lakes, adding complexity. It is also primarily supported in Confluent Cloud, lacking the same level of vendor-backed deployment options as Flink and Spark.
Apache Flink sits in the middle, providing decent catalog and cloud support, but lacking the full-fledged lakehouse capabilities of Spark. While Flink supports Hive, JDBC, and custom user-defined catalogs, it is not as tightly integrated with enterprise catalog solutions such as AWS Glue Data Catalog or Unity Catalog. Flink’s broad file format support (Parquet, Avro, ORC, JSON, CSV) and compatibility with Delta, Iceberg, and Hudi make it a strong contender for real-time streaming into lakehouse architectures, but its catalog integration could be more seamless.
For teams focused on deep ecosystem integration, seamless lakehouse compatibility, and analytics workflows, Spark Structured Streaming is the clear winner. Flink remains a solid choice for high-performance streaming with broad file format compatibility, while Kafka Streams is best suited for Kafka-native applications but falls short in broader data ecosystem integration.
🏆 Best in class: Apache Spark
❌ Worst in class: Kafka Streams
Even after unpacking all the technical details hidden under the depths of Google, assessing community momentum is inherently subjective; it varies by use case, ecosystem, and even search terms.
The stats to rank the community momentum or adoption are not openly available, for example both Structured Streaming or Kafka Streams projects are wrapped inside Spark and Kafka Github repositories, so tracking the number of stars at the streaming project level is impossible. Therefore we decided to rank each engine categorically and not declare an overall winner/loser.
We recommend you start from top down and choose the right engine for the job in hand.
For low-latency, high-throughput requirements such as fraud detection, IoT and places where real-time analytics is critical with checkpointing, failure recovery, and batch-stream unification, we recommend choosing Flink. Flink’s event-driven architecture, incremental state snapshots, and fine-grained resource management make it superior for high-scale, low-latency workloads. It can dynamically scale based on demand, efficiently handle stateful workloads, and process millions of events per second with sub-millisecond latency. Unlike Kafka Streams, which is tied to Kafka partitions, and Spark’s Structured Streaming, which relies on micro-batches, Flink continuously processes data without batch delays.
On the other hand, Kafka Streams is a better solution for event-driven microservices. It works well for real-time transformations, stateless processing, and simple aggregations without requiring extra infrastructure.
If real-time ETL is not a priority, Structured Streaming could be ideal because of its batch-like stream processing capabilities. It supports some SQL-based transformations and has a rich ecosystem integration. If you are already using Spark for batch processing, your team has one less framework to learn.
No matter which engine you choose, the true power comes from being able to mix and match engines, scale with confidence, and take advantage of ecosystem efficiencies - all without being locked into a single architecture
That’s where Onehouse comes in. At Onehouse, we believe in open data, open engines, and open possibilities. Our platform brings:
With Onehouse, not only do you get fully managed ingestion, incremental ETL, and table optimizations, but now with our newly released Open Engines™ platform you now have the ability to launch clusters with a variety of open source engines at the click of a button:
Ready to tackle your analytics challenges? We would love to chat! Contact Onehouse at gtm@onehouse.ai to get the details behind how we did these comparisons and how you can achieve 2-30x performance gains seamlessly running Spark, Trino, Presto, StarRocks, or ClickHouse.
Be the first to read new posts