Data quality is the process of ensuring that data is accurate, consistent, complete, and reliable. It is essential in data engineering to ensure that the data used for analysis and decision-making can be trusted. Organizations utilize various techniques such as data validation, cleansing, profiling, and monitoring to identify and resolve issues, thereby maintaining data integrity and achieving high-quality data.
In this blog, we will explore how Apache Hudi's pre-commit validation feature can be leveraged to ensure data quality by validating data before it is committed to the storage system.
Importance of data quality in a data lakehouse
Apache Hudi , originally developed at Uber in 2016, is a cutting-edge data lakehouse technology that has become an Apache open source project since 2017. It has gained widespread adoption and contributions from major enterprises like Uber, Amazon, Walmart, GE, and others. Hudi brings database-like features to data lakes, paving the way for the data lakehouse paradigm. With Hudi, users can perform updates and deletions on their data lakes with transactional consistency and high performance, revolutionizing data lake management.
Apache Hudi provides pre-commit validators that allow users to validate their data against specific data quality expectations during the writing process using DeltaStreamer or Spark Datasource writers. These validators ensure the integrity and quality of the data being written.
To configure these validators, users can use the hoodie.precommit.validators setting, which accepts a comma-separated list of validator class names. This configuration provides flexibility for customizing the data quality validation process according to specific requirements.
The pre-commit validators in Apache Hudi enable users to enforce data quality checks such as uniqueness constraints, schema compliance, data consistency, and adherence to business rules. By performing these checks before committing the data to a new version, only high-quality data is written to the storage system.
By leveraging this feature, users can choose from a range of built-in validators or create their own custom validators to meet their specific data quality needs. This ensures that the written data meets the desired quality standards and enhances the reliability and accuracy of the overall data lake or data warehousing system. For example, if a user wants to apply multiple validators, they can configure them as follows:
In the above example, two validator classes, ValidatorClass1 and ValidatorClass2, are specified and will be executed during the pre-commit phase.
Apache Hudi is bundled with a pre-implemented validator called SqlQuerySingleResultPreCommitValidator, which validates that a SQL query on a table produces a specific single value result. Users can provide multiple queries, separated by a ';' delimiter, and the expected result is included as part of each query, separated by '#'. This feature allows users to verify specific conditions or expectations on the data. Ensure that the queries used for validation are not terminated by a semicolon, as the semicolon is used as a separation character for multiple queries.
The following example demonstrates how to implement a validation that restricts the insertion of null values for the "name" column using Spark SQL:
In the provided code block, "Query 1" fails with the error message "At least one pre-commit validation failed." This occurs because the validator query checks for null values in the "name" column, and the query results in 1, which is not equal to the expected result of 0.
The error indicates that the validation condition is not met, and the insertion of null values in the "name" column is restricted as intended.
Here's another comprehensive example that implements the following rules:
Apache Hudi offers the SparkPreCommitValidator class, which users can extend to define custom pre-validators based on their specific use cases. By extending this class, users can create their own validation logic to ensure data quality before the commit operation.
The SparkPreCommitValidator class serves as an abstract class that users can subclass to implement their custom pre-validators. The key method to override is validateRecordsBeforeAndAfter, which takes two Dataset<Row> parameters representing the data before and after the commit, and a Set<String> parameter indicating the affected partitions.
With the ability to extend the SparkPreCommitValidator class, users have the flexibility to incorporate their own data quality checks, apply custom business rules, and implement specific validation conditions. This empowers users to create pre-validators that align with their unique requirements.
To utilize the custom pre-validator, users can reference their implementation in the configuration by specifying the class name when configuring pre-commit validators using the hoodie.precommit.validators property.
By Leveraging the SparkPreCommitValidator class, Apache Hudi empowers users to define custom pre-validators that enable comprehensive data quality validation. These tailored validators ensure that the validation process aligns precisely with users' specific use cases and quality expectations.
By configuring and utilizing these pre-commit validators in Apache Hudi, users can enforce data quality expectations and ensure the integrity, consistency, and correctness of the data being written to their data lakehouses.
These pre-commit validators provide users with powerful tools to validate their data and ensure it meets the desired quality standards, enhancing the reliability and accuracy of their data-driven applications and analytics processes.
We hope that this blog will assist users in implementing pre-commit validators for data quality checks in their lakehouse data pipelines. We value your feedback and encourage you to share your thoughts with us. Feel free to engage with us in the Hudi community and join our Slack! channel.
Be the first to read new posts