January 30, 2025

Ingestion at Scale: Comparing Onehouse, Fivetran, and Airbyte for Cost and Performance

Ingestion at Scale: Comparing Onehouse, Fivetran, and Airbyte for Cost and Performance

Introduction: What You Need for Data Ingestion

In today’s cloud-first data era, data ingestion has become a critical yet complex process for enterprises. With hundreds of data sources—SaaS applications, operational databases, IoT devices, log streams and many more—organizations face an ever-growing challenge of efficiently ingesting and centralizing data. Adding to this complexity are diverse data formats (structured, semi-structured, and unstructured), open table formats such as Apache Hudi™, Apache Iceberg™, and Delta Lake, and destinations that include data warehouses, data lakes and modern data lakehouses.

In this space, Fivetran and Airbyte have emerged as two popular data ingestion solutions, enabling organizations to connect data sources to their data platforms with ease. Onehouse is a new product on the market that offers data ingestion that is specialized and purpose-built for modern data lakehouses, leveraging open table formats such as Apache Hudi, Apache Iceberg, and Delta Lake to provide robust data ingestion capabilities.

In this blog, we compare Fivetran, Airbyte, and Onehouse on cost/performance at different data scales to help you choose the best solution for your data ingestion and analytics needs. These are the core dimensions we will dive deep into:

  1. Pricing Model: These solutions provide pricing options based on various metrics. From the customer’s perspective, transparency and simplicity in pricing are essential.
  2. Cost/Scale: Cloud data ingestion can become costly if not optimized for scale and usage. An ideal solution minimizes operational and infrastructure costs while efficiently handling high volumes of data.
  3. Performance: Ingesting data quickly is essential for powering near real-time analytics and decision-making. Organizations cannot afford delays in critical insights, especially when dealing with streaming data sources or frequent batch updates.
  4. Use Cases: A flexible data ingestion tool seamlessly supports a variety of use cases - batch ingestion, near real-time streaming, and change data capture (CDC) while accommodating diverse data sources and destinations.

Pricing Model

Fivetran, Airbyte, and Onehouse all provide data ingestion capabilities, but their distinct pricing models can lead to substantial variations in the total cost of your solution.

Row-Based Pricing

  • Fivetran employs a row-based pricing model using Monthly Active Rows (MARs), which represent the unique rows added, updated, or deleted in the destination during a given month.
  • Pros:
    • MAR is counted only once per row per month, regardless of how many updates are made.
    • Pricing is independent of the number of connectors used.
  • Cons:
    • Costs escalate rapidly with increasing data volumes.
    • The pricing structure is complex, with multiple models and parameters, making cost prediction challenging.

Volume-Based Pricing

  • Airbyte employs a volume-based pricing model, meaning you pay based on the size in bytes replicated from database sources. 
  • Pros:
    • Airbyte doesn’t charge for failed syncs or normalization processes.
    • Volume discounts are available for larger customers.
  • Cons:
    • Costs increase linearly with the data volume processed.
    • Multiple pricing options complicate planning and decision-making.

Compute-Based Pricing

  • Onehouse uses a compute-based pricing model measured in Onehouse Compute Units (OCUs), which represent normalized compute resources. Billing is based on the total OCU-hours consumed, regardless of volume or rows, aligning costs closer to the amount of work performed.
  • Pros:
    • Simple and transparent with no additional costs for data volume or storage.
    • Sub-linear price scaling as a result of efficient resource sharing between ingestion, transformation, and optimization workloads.
  • Cons:
    • Costs are harder to estimate upfront as they depend on runtime and resource allocation.
    • Designed for optimum efficiency at medium to large scale data volumes.

Onehouse’s compute-based pricing is the most fair and flexible for modern workloads because it aligns costs directly with value produced rather than data volume. By tying pricing to actual compute usage rather than arbitrary metrics like row counts or data volume, Onehouse offers a more equitable and transparent cost structure that adapts to customers' varying workload demands.

Performance

Fivetran

Fivetran is a proprietary enterprise tool for database extraction and ingestion and it offers two primary solutions:

  1. Fivetran Native Connector: Best suited for low-volume replication, typically handling fewer than 100 million MARs.
  2. HVR (High-Volume Replicator): Designed for large-scale replication, capable of handling databases with 100 million MARs or more.

Fivetran HVR is a self-hosted enterprise-grade solution from Fivetran that supports high-volume data replication. In this blog, we focus on use cases involving large databases with over 100 million MARs (or 100GB/month), utilizing Fivetran's HVR solution. Reported Fivetran data ingestion throughputs range from 7MB/sec to 16MB/sec.

HVR optimizes data transport by capturing only relevant changes, compressing them at the source, and transferring the data in a compressed format to the HVR hub. Decompression occurs just before delivery to the target, leveraging high-speed native technologies like clustered file systems and staging tables for maximum performance. Its micro-batch (burst) approach stages data and uses set-based SQL for efficient, near real-time merging into the target.

However, Fivetran's HVR solution requires the installation of custom agents on source databases and, in some cases, even target destinations. This setup introduces potential security risks and operational complexities.

Fivetran is primarily designed for batch data replication and ingestion, operating at sync frequencies of several minutes or longer. While effective for periodic data transfers, it has notable speed limitations. Achieving faster ingestion performance relies heavily on the vertical scaling of the HVR server, which can be resource-intensive and costly. It struggles to meet near real-time Service Level Agreements (SLAs), such as those requiring data synchronization at the sub-minute level. 

This limitation stems from the company’s strategic focus on batch processing rather than real-time or streaming use cases. Real-time data replication presents unique challenges, including ensuring low latency, maintaining high data quality, and achieving consistent data synchronization—all areas where Fivetran has chosen not to specialize.

Airbyte

Airbyte offers both an open-source version and a fully managed cloud-hosted solution, giving customers the flexibility to choose the deployment model that best suits their needs. Customers can deploy and manage the open-source version independently or opt for a hybrid model by purchasing support from Airbyte. Additionally, Airbyte provides a fully managed service called Airbyte Cloud, which will be the primary focus of this blog.

Airbyte is designed to streamline API automation and simplify connections to a wide variety of data sources. It is particularly well-suited for small to medium-scale database replication use cases, typically handling data volumes of a few 100s of GB per month with latency SLAs of one hour or more. While Airbyte can efficiently address many common data integration requirements, its performance may not be ideal for high-volume, low-latency scenarios that demand real-time data replication or processing at a sub-minute level. Reported Airbyte data ingestion throughputs range from 7MB/sec to 14MB/sec.

Onehouse

Onehouse pioneers the universal data lakehouse architecture, enabling organizations to move beyond disjointed legacy systems and adopt a modern, unified approach. With its innovative architecture, Onehouse provides customers with complete control over their data and compute resources. By running the data plane entirely within the customer’s cloud account and VPC, Onehouse ensures that data remains secure and never leaves the customer’s environment.

The Onehouse Compute Runtime (OCR) is the first compute runtime to seamlessly support all table formats, query engines, and catalogs, powering Onehouse workloads like ingestion, incremental ETL, and table optimizations. Fully compatible with open source Apache Spark™, OCR incorporates advanced technologies like Spark multiplexing, vectorized columnar merging, and optimized Parquet writers to enhance throughput and reduce processing time. By minimizing network requests compared to open source Parquet readers, OCR ensures faster and more efficient data access. Built on three pillars—Adaptive Workload Optimizer, Serverless Compute Manager, and High-Performance Lakehouse I/O—OCR delivers unparalleled performance and cost efficiency for modern lakehouse workloads.

Onehouse's platform supports horizontal and auto-scaling, with throughput varying based on workload type (append-only vs. mutable). Average throughput scales with cluster size, reaching 50MB/sec to over 100MB/sec for medium to large clusters.

Cost and Scale

Fivetran

Fivetran’s costing model is based on Monthly Active Rows (MAR), a metric that is not only complex to understand but also difficult to estimate in advance. While this model works well for small data volumes, such as those typically generated by SaaS applications, it becomes prohibitively expensive for large-scale operations. 

At larger data scales, scalability and cost challenges about Fivetran have been reported in the public domain. Fivetran's HVR solution claims to handle medium to large-scale data ingestion scenarios exceeding 100GB/day, but there are limited publicly available references to substantiate this claim.

Modern data lakehouses often need to replicate significant volumes of data—ranging from 10GB to thousands of GB per day—from operational databases. In such cases, Fivetran’s MAR-based pricing can escalate quickly. For example, replicating a medium-sized PostgreSQL database with 100 million MARs per month would cost approximately $9,350/month on Fivetran’s standard plan, assuming 1000 bytes per row. This pricing structure makes Fivetran less competitive for high-volume, enterprise-scale data replication use cases. 

Airbyte

Airbyte Cloud uses a volume-based pricing model that depends on the total amount of data replicated (for database sources) and the number of rows replicated (for API sources). According to Airbyte’s cost calculator, their pricing starts at $10/GB, with costs per GB decreasing as monthly data volumes exceed 250GB. This tiered pricing structure makes Airbyte an appealing choice for organizations managing small to moderate data volumes, offering cost savings as usage scales. 

Airbyte faces notable scaling challenges, particularly due to the lack of built-in auto-scaling. According to their documentation, scaling Airbyte requires ensuring the underlying Kubernetes cluster has enough resources to schedule its job pods. Users must manually manage cluster capacity to handle the number of pods Airbyte creates. This process involves starting with mid-sized cloud instances (e.g., 4 or 8 cores) and manually adjusting instance sizes for workload demands, making it labor-intensive and less efficient for large-scale or real-time data replication.

Onehouse

Onehouse delivers cost efficiency by streamlining data ingestion and transformation processes. It can ingest and process operational and Kafka event data across a single bronze and silver layer, eliminating the need for redundant data preparation steps. The platform also employs optimized Onehouse Compute Runtime (OCR) on low-cost Spark compute, further driving down operational expenses. 

Onehouse streamlines the journey to a data lakehouse, reducing deployment times from months to days. It excels in scenarios involving high data volumes, making it an extremely cost effective solution for organizations managing 100GB/month or more. 

Figure 1 illustrates the monthly cost comparison of Onehouse, Fivetran, and Airbyte for typical database replication use cases across various data volume scales. 

Figure 1: Monthly cost comparison across a wide range of data volumes

Notes: Data points for Airbyte usage are not readily available beyond a scale of 800GB per month. Costs for higher data volumes for Airbyte have been calculated based on an assumed rate of $6/GB.

At low data volumes (≤100GB/month), Airbyte emerges as the most cost-effective solution, while Onehouse maintains a flat cost within this range due to the baseline infrastructure required to run an active Kubernetes cluster and support an observability and monitoring stack (Figure 2). 

Figure 2: Zooming in from figure 1, cost comparisons at a low data volume scale 

Use Cases

Fivetran and Airbyte are designed to connect to a wide variety of data sources, supporting a diverse range of use cases. They provide connectors for:

  • Data Sources:
    • Databases
    • SaaS applications
    • Streaming platforms
    • File storage systems
  • Destinations:
    • Data warehouses
    • Databases
    • Data lakehouses

The combination of numerous source-to-destination permutations explains why Fivetran and Airbyte need to maintain and support over 500 connectors. Their focus on breadth enables extensive connectivity but comes with the tradeoff of spreading resources across many low-volume integrations.

In contrast, Onehouse follows a different business model. It focuses exclusively on ingesting data from high-volume sources, not on the development of longtail SaaS connectors. This targeted approach allows Onehouse to specialize in medium to large-scale data ingestion use cases, delivering significant cost and performance advantages over solutions like Fivetran and Airbyte (Figure 1).

Onehouse writes data solely into data lakes, and syncs the data lake tables to data warehouses as external tables. We consider directly writing into warehouses without a data lake in the picture as an anti-pattern as it tightly couples data sources to warehouses, driving up transformation costs in expensive data warehousing engines. Instead, Onehouse’s vision is to build a universal data lakehouse that supports in-place data transformation, processing, and validation including:

Fivetran offers a Databricks Connector. Using it, however, requires a running Databricks cluster for writing into Delta Lake. This setup leads to paying both Fivetran and Databricks for basic mutable data lakehouse ingestion, creating higher costs and vendor lock-in.

Onehouse’s approach avoids these pitfalls by focusing on high-performance, cost-efficient, and flexible data lakehouse solutions that support scalable, enterprise-grade workloads without unnecessary complexity or expenses.

While Fivetran and Airbyte have recently introduced support for Iceberg and Delta Lake data lakes as destinations, their solutions remain immature in this area. Both platforms only offer basic table maintenance operations, such as deleting old snapshots, metadata file versions, and orphan files. However, these maintenance tasks require additional resources and can interfere with regular ingestion jobs, leading to increased data latency—for instance, Fivetran schedules maintenance only on Saturdays to minimize sync delays. In contrast, Onehouse provides out-of-the-box support for asynchronous table optimizations, including auto file sizing, clustering, compaction, and cleaning, all designed to run without impacting the performance of ingestion jobs.

Over the past three years, Notion's data volume has expanded 10x, driven by increased user activity and content creation (Figure 3). To handle this growth, they initially horizontally sharded their PostgreSQL database to 480 logical shards across 96 physical instances. For offline analytics, Notion utilized an ELT pipeline with Fivetran to replicate PostgreSQL data into Snowflake. However, managing 480 Fivetran connectors became operationally complex and costly, especially with their update-heavy workload, where 90% of upserts were updates. These challenges led Notion to migrate from Fivetran to an in-house solution using Apache Spark for distributed processing and Apache Hudi for efficient upserts, incremental ingestion, and schema evolution.

This migration significantly optimized Notion’s data pipeline, reducing costs, eliminating connector complexity, and improving performance. The new lakehouse architecture, powered by Spark and Hudi, saved over a million dollars annually by eliminating slow and costly Snowflake loads and dramatically improving efficiency. Historical data syncing, which previously took a week, was reduced to just two hours (an 84x improvement), and incremental syncing now occurs every four hours via Hudi Streamer. This scalable and flexible infrastructure enables efficient data processing without impacting live databases or user-facing performance.

Figure 3: Notion’s data platform journey

Feature Comparison

Unlike Airbyte and Fivetran (see Table 1), which are general-purpose ELT tools, Onehouse is purpose-built for medium to large-scale data ingestion use cases, such as those involving Kafka sources and database Change Data Capture (CDC). Onehouse is tailored to meet the demands of continuous ingestion and real-time data synchronization at scale, offering the following key advantages:

  • Continuous Ingestion: Seamlessly capture and process data in real-time
  • Automatic Optimizations: Ensure data efficiency and performance with minimal manual intervention
  • Low-Code Pipelines: Simplify the creation and management of data workflows
  • Open Storage: Maintain flexibility with support for multiple table formats, query engines, and cloud platforms
Onehouse Logo Onehouse Airbyte Logo Airbyte Cloud Fivetran Logo Fivetran
Founded in 2021 2020 2012
Specialization End to end cloud-native data lakehouse SaaS solution General purpose ELT tool focusing on tail connectors General purpose ELT tool focusing on tail connectors
Cost Compute based pricing . Highly cost effective at medium to large data volumes. Volume-based pricing . Cost increases linearly with the data volumes and becomes expensive at high scales (>500 GB/month). Row-based pricing  with MARs (monthly active rows).
Cost can increase non-linearly at large data volumes.
Scale Medium to large data lakehouses of 100GB to 100s of TB per month. Mostly small to medium data ingestion < a few 100’s of GB  per month.
Note: This is based on the author’s research.
Mostly small to medium data ingestion < a few 100’s of GB  per month.
Note: This is based on the author’s research.
Latency Natively supports Kafka  with data latency as low as 30 seconds. Typical data latency  is one hour or more. Standard data latency is 15 minutes but can be as low as 1 minute .
Open Source Managed services based on OSS infra for no lock-in, all data stored in OSS table formats Self-hosting OSS version . N/A
Sources High volume sources : Databases, event streams, and cloud storage, tail connectors w/ partners More than 500 More than 500
Data Warehouse Sink Indirectly ingest into warehouses as external tables  ex: Snowflake Iceberg, Redshift Spectrum, and BigQuery BigLake Directly ingest into warehouse , data is not interoperable with other engines Directly ingest into warehouses , data is not interoperable with other engines
Data Lakehouse
Sink
Ingest to all table formats , materialize updates from CDC sources, and automate all table maintenance with 30x perf optimizations Delta Lake , Iceberg beta ,
Limited table maintenance for Iceberg only
Delta Lake and Iceberg connectors  
Limited Iceberg table maintenance only on weekends
Custom Transformation Use Confluent Connectors  and custom transformers based on Java and Scala Leverage the no-code / low-code / AI-powered Connector Builder Limited through Fivetran Cloud functions
Medallion Architecture Support for end to end medallion architecture JSON object normalization and dbt integration  rely on the target database Pre-built data models and dbt integration  rely on the target database
Data Optimizations Support for both inline and async table optimizations  including auto file sizing, clustering, compaction, and cleaning Basic table maintenance  operations Basic table maintenance  operations
Compliance SOC 2 Types 1 and 2 and PCI DSS SOC 2 and ISO 27001 SOC 2, ISO 27001, PCI DSS, and HITRUST
API SQL-style REST APIs to manage resources and support CI/CD deployment model APIs  available through Airbyte Cloud and Airbyte’s open source edition Available through Powered by Fivetran .

Table 1: Onehouse vs. Airbyte vs. Fivetran

Conclusion

To select the best tool for your specific analytics use case, we strongly recommend running Proof of Concepts (POCs) and testing each platform. All three tools offer free trial options:

Testing these solutions will provide hands-on experience to help you evaluate their performance, features, and suitability for your specific requirements.

If you're interested in learning more about Onehouse and how Onehouse can help you build a cloud-native data lakehouse in a few days, please contact us at gtm@onehouse.ai or sign up via our listing on the AWS Marketplace today!

Authors
Portrait of a man in a suit and tie.
Po Hong
Lead Solutions Architect

Po is a Lead Solutions Architect and Manager at Onehouse, bringing over 20 years of experience in Big Data, enterprise data management, and cloud computing. Since joining Onehouse in July 2022, Po has been instrumental in delivering innovative solutions for customers. Prior to Onehouse, he served as a Principal Data Architect at AWS and spent six years in various roles, focusing on designing and implementing large-scale cloud data warehouses and data lakehouses for numerous AWS customers.

Subscribe to the Blog

Be the first to read new posts

We are hiring diverse, world-class talent — join us in building the future