April 19, 2024

Schema Evolution on the Data Lakehouse

Schema Evolution on the Data Lakehouse

In today’s world of data streaming and complex distributed architectures, data table schemas are constantly evolving. Data is produced by a variety of sources, and may be touched by several teams within a company along the journey from origination to insights. In addition, while modern data architectures provide opportunities for next-level analytics and machine learning, these complex systems introduce new operational challenges. One particularly thorny challenge is handling schema evolution (aka schema drift).

Schema evolution is the case where a database table’s schema changes over time. For example, a new column may be added, or the type of an existing column may be changed. Without proper handling, schema evolution can lead to data pipeline failures, data quality issues, and even data loss.

This article describes how schema evolution can cause problems for data management and shows how Onehouse gracefully handles such changes in the data lakehouse. 

Why Does Schema Evolution Occur?

When working across automated systems and inter-team dependencies, it’s common that the structure of data will evolve over time. For example:

  • A product team adds a new column in their telemetry events to track a user’s time spent on a page 
  • A finance team changes the format of financial codes in their database from a numerical system to an alphanumeric system to align with international standards
  • An eCommerce team adds a new table in their operational database with data about product categorizations

When all of these data sources are ingested into a data warehouse or data lake/lakehouse for analytics, the data engineering team must ensure that schema changes are properly handled without breaking downstream pipelines, dashboards, and applications. This quickly becomes a complicated and time-consuming task. As data sources expand, and as upstream engineering teams make changes without understanding the potential effects on analytics, handling schema changes can become a complicated and time-consuming task that hinders operational efficiency.

More Changes, More Problems

Data teams must be prepared to handle schema changes, or they will face painful challenges. We frequently see these common schema-related challenges in data organizations:

Data Pipeline Failures

An upstream engineering team modifies the schema of a table without notifying the team responsible for analytics. An ingestion pipeline unprepared to handle schema evolution will fail upon schema changes, triggering on-call alerts, delaying data for consumers, and potentially causing data loss if the source data expires.

Data Quality Issues

Data quality issues come in all shapes and sizes, but often share a common cause: misaligned expectations between data producers and consumers. For example, a data producer might change the type of a column from an Integer to a more-precise Double; a system not designed to handle schema evolution might attempt to coerce the Double into an Integer, losing the precision of the source data, or even causing errors downstream. Another common case is for data to arrive that’s missing an expected column; when data consumers try to join tables on this column, they will be missing (potentially many) data points.

Inconsistent Systems

Data engineers constantly grapple with the challenge of keeping their analytics systems consistent with the source systems. As data moves from sources (operational databases, event streams, etc.) to analytics-ready storage (data warehouse or data lake/lakehouse), pipeline delays and failures - frequently caused by schema evolution - make these systems fall out of sync. When this happens, analytics systems, such as BI dashboards and ML models, are no longer operating on an up-to-date understanding of the world.

How Onehouse Automates Schema Evolution and Management

Onehouse provides a managed service for building and managing your data lakehouse. Onehouse handles schema evolution as data is ingested into the lakehouse, so schema changes in your upstream sources don’t disrupt the delivery of fresh, high-quality data to the lakehouse.

Backward-compatible Schema Evolution

Onehouse detects schema changes as they occur within the ingested data and automatically handles schema evolution in a backward-compatible way. Schema evolution is considered backward-compatible when the changes do not cause the loss of any existing data in the table. Backwards compatibility is critical to avoid breaking queries against downstream tables.

For example, dropping a column from the table would not be backward-compatible, because existing records in the table would lose the data from that column. Adding a column, on the other hand, can be handled in a backward-compatible fashion. Similarly, changing the type of a column to a narrower type (eg. changing a Double to an Int) would not be backward-compatible because the new type might not be able to store the existing values in the table, whereas the opposite change would be backward-compatibility-friendly.

When backward-compatible schema evolution occurs in the source, your Onehouse tables are automatically updated to match the latest schema. Even in some cases where the source schema is changed in a non-backward-compatible way, your Onehouse tables are gracefully updated to incorporate the schema changes while maintaining backward-compatibility. Onehouse automatically syncs the updated schemas to all of your data catalogs, so data consumers always see the latest state of the table.

This allows you to flexibly handle changes in upstream systems without your pipelines breaking and requiring manual intervention. You’ll save time repairing pipelines, and your downstream data consumers will be able to operate without data delays. 

Onehouse supports the following schema evolution use cases:

Adding New Columns

When you add a new column in your source, the new column will be added in the Onehouse table. All previous records will have a null value for the new column.

Removing Columns

When you remove a column in your source, the column will remain in the Onehouse table to maintain backward-compatibility. Existing records will keep their data for the column, while new records contain null values for the column.

A common pitfall for removing columns is completely removing the column in the lakehouse table when the column is removed in the source. This type of non-backward-compatible change can break downstream queries. A better approach is to maintain a view of the table in the downstream query engines. If the downstream consumer no longer needs the column, they can easily update the column in their view without deleting the column from the lakehouse storage, thereby preventing data loss of the column's pre-existing values.

Updating Column Types

Onehouse handles column type updates in a backward-compatible way by using the logical superset of the new and old type. This superset (aka super-type) is the type which can contain all of the column’s values without losing data. For example, the super-type of Int and Long would be Long, while the super-type of Long and String would be String.

Incremental Processing

A critical capability of an efficient data processing tool is the ability to write schema changes incrementally. For example, when a new record from the source adds a column to the schema, Onehouse writes only that new record, avoiding expensive full-table rewrites. Other ingestion and storage systems may rewrite the entire table when the schema is changed, sticking you with the compute bills for the large rewriting operation.

Data Quality Quarantine

In some cases, you might not want to evolve a schema, or your data sources may have schema changes that are not backward-compatible. Onehouse helps you handle these cases gracefully by quarantining data that does not meet your pre-defined rules.

When configuring an ingestion job in Onehouse, you can add data quality validations, such as enforcing a specific schema or requiring a column to contain only values within a defined range. When an incoming record fails to meet these validations, Onehouse writes it to a separate quarantine table within your storage bucket. This ensures that your pipeline runs uninterrupted with only high-quality data, and enables you to discover data quality issues and resolve them from the source.

Auto-Detect New Tables and Topics

Similar to modifying the schema of an existing table, upstream data producers may create new tables in an operational database or new topics in an event stream. Onehouse can automatically detect these new entities and materialize them as analytics-ready tables in your data lakehouse. This improves the visibility of data across your organization and saves you the time and effort of manually configuring new pipelines when new source topics or tables are created.

The “Shift-Left” Movement and the Universal Data Lakehouse

In the data world, “shifting left” means moving your data quality checks and management further upstream in your pipeline. This enables you to catch and resolve data quality issues as they occur, preventing painful downstream outages. Onehouse helps you shift your data quality checks and management to the left by handling schema evolution as data is ingested into the data lakehouse.

Onehouse is built around the Universal Data Lakehouse architecture, which simplifies management and operations by providing a universal storage layer to power all of your organization’s analytics. All of your downstream consumers benefit when you take care of schema evolution in this universal layer.

Conclusion

Schema evolution poses significant challenges in modern data architectures, often leading to data pipeline failures, data quality issues, inconsistent data, and even data loss. For teams ingesting data from various sources into a unified data warehouse or lake/lakehouse, handling schema evolution becomes a complex and time-consuming task.

It’s important to build, or buy into, a data ingestion platform that can gracefully handle schema evolution. Onehouse addresses these challenges head-on with its automated solutions such as schema evolution, data quality quarantine, and automatic data detection. These capabilities save data organizations time and resources while enhancing the reliability and accuracy of analytics.

Let us know how you’re solving schema evolution in the comments, and reach out to gtm@onehouse.ai if you’d like to learn more about what Onehouse can offer.

Authors
No items found.

Read More:

Building an ExaByte-level Data Lake Using Apache Hudi at ByteDance

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future