December 10, 2024

Comprehensive Data Catalog Comparison

Written by:

Kyle Weller

Intro

The race to build the most comprehensive data catalog is accelerating as vendors compete to become the central hub for an organization’s data stack. With so many options on the market, it can be challenging to evaluate and choose the right catalog for your environment. At Onehouse, we don’t build or offer a catalog, instead we feature a multi-catalog synchronization utility. We have many users who use multiple catalogs, and others who ask us for advice and recommendations for what they should use. I recently hosted a panel discussion with some of the industry’s top minds building open source data catalogs: Unity Catalog, Apache Polaris (Incubating), DataHub, and Apache Gravitino.

(*https://opensourcedatasummit.com/data-catalogs-panel/*)

As I started to do my own research I found many great blogs and resources, but I quickly realized that there was no neutral or comprehensive source of information which explained in depth what a catalog is and compared features across offerings.

Before you get deep into the read I want to call out a few things. First, the definition of a catalog is very broad, the use cases are diverse, and the needs of different organizations can be unique. The engineering communities I mostly engage in are focused on data engineering revolving around data lakes, or data lakehouse architecture. So rather than trying to create a comparison which evaluates catalogs from multiple user points of view, I will try to take the point of view of an organization who may be building a data platform where a data lake is a key component as the central store for most of its data. Lastly, as any good comparison article should call out, do your own research… I reference my research here throughout, but your goals, needs, and architecture are unique, there is no one single “winner” for everyone.

Since this blog is lengthy, use these anchor links to skip to the section that matters to you:

What is a Catalog

Table Formats and Catalogs

How to Choose a Catalog

Feature Comparison Ratings

When to use Which Catalog

‍

What is a data catalog?

Basic description

In the most general sense, a data catalog is an organized inventory of data assets within an organization. It helps users discover, understand, and manage the data available for use. A data catalog typically includes metadata (information about the data) such as data sources, descriptions, owners, quality metrics, lineage, and access controls. It acts like a searchable directory for datasets, similar to how a library catalog organizes books.

Use cases

Data Discovery and Exploration

If you have data, no matter if your company is large or small, even if you are the only data engineer in your company, you have a data discovery problem. Data catalogs play a key role in keeping track of and surfacing awareness of what data assets are available. Catalogs have a broad spectrum of effectiveness on this use case. In the most rudimentary form (more like a metastore), a catalog may simply be a place that you store a table name, schema, and reference to where it can be found. Others will have advanced crawlers and AI features exposing advanced search or proactive recommendations of datasets. Examples of features that matter for data discovery and exploration:

Crawling to find all assets - how can a catalog gather knowledge of all data?
Browse/Discover - how easily can users browse and discover what they need?
Search - how easily can users intentionally search for specific data?
Business Glossary - what additional metadata is available including tags, documentation, statistics, etc?
Classification of data - can I classify data with certain labels for compliance/security?

Data Governance - Access Control

While data is a catalyst to innovation, it is also a potential liability that needs to be secured and safely governed. Since a data catalog has a comprehensive inventory of data across your estate, many catalogs also offer robust ways to author and maintain access control policies. Good access control needs to fulfill a delicate balance of rigorous control, but also provide a smooth experience for those granted permissions to access data without delays and friction. Examples of features that matter for data governance and access control:

Policy authoring - how can a user express who has access to what data?
Authentication and Authorization - how is a user identity validated, and how are credentials and secrets managed?
Access enforcement with query engines - Is access to the root data in storage secured and governed by the catalog? When a query engine wants to access data can it go around the catalog and get it from storage anyways or does it have to leverage the catalog to be granted access at query time?

Data Governance - Compliance

In addition to standard access control, organizations are challenged to meet global compliance certifications such as SOC II, PCI/DSS, HITRUST, FedRAMP, etc, which require deeper knowledge about what data exists in what data stores. A key component is maintaining a data inventory which classifies data based on sensitivity and regulatory requirements, such as personal, confidential, or public categories. This classification helps organizations apply appropriate protections, such as retention policies, encryption for sensitive data or stricter access controls for personally identifiable information (PII). Regular audits and assessments ensure the inventory and classifications remain accurate and align with laws like GDPR, CCPA, or HIPAA. Examples of features that matter for compliance include:

Data classification - can appropriate labels be placed on data to ensure they are safely handled according to internal policies.
Retention policies - can automatic expiry, deletion, or prevention of the deletion of data be set to meet compliance standards?
Auditing - is there logging of all activities kept for regulatory audits?

Data Lineage and Documentation

Do you remember the last time you were troubleshooting a data quality problem on your table? Maybe some unexpected data came in, but you don’t know where from? Data lineage provides a detailed map of how data flows through an organization—tracking its origin, transformations, and destinations. This visibility provides the context for identifying errors and ownership of data for accountability. Some data catalogs even support the documentation of or definition of metrics and aggregations of data. Comprehensive documentation complements lineage by recording the methods, tools, and standards used to process and analyze data, making it easier for teams to collaborate and troubleshoot. Examples of features that matter for lineage:

Data Lineage - how well can you track the sources and dependencies of your data?
Data ownership and accountability - is data ownership clear?
Metric definitions - can metrics be defined and tracked?

Data Quality Management

Data quality management ensures that data is accurate, reliable, and fit for purpose. Some key features of data quality systems include:

Schema Management - where engineers define and enforce structures for datasets, such as data types, constraints, and relationships, to maintain consistency and prevent errors during data integration or analysis.
Data Quality Policies - standards and rules for data accuracy, completeness, and validity, ensuring alignment with business and regulatory requirements.
Monitoring/Alerting - to maintain real-time reliability, engineers implement monitoring and alerting systems that continuously track metrics like error rates, anomalies, and missing values, notifying teams immediately when issues arise.
Data Freshness - ensuring that datasets are up-to-date and reflect the most current information.

The breadth of features and use cases for data catalogs is wide and diverse. If there are other categories of features you want to see analyzed, drop me message. I would love to add them: kyle@onehouse.ai

Categories of catalogs

With so many catalogs on the market it is easy to get confused between different options. If you are new to this space, let me try to break down catalogs into a few high level categories:

Catalog Categories

Catalog Category	What is it for?	Tradeoffs
Metastores	We will discuss metastores in much greater detail below, but the summary is that metastores focus on storing metadata about tables, schemas, and partitions, statistics, primarily for structured data in data lakes. Metastores are used more as functional catalogs for query engines to reference more than they are used for the business to discover, learn, and govern data. In some cases such as with Apache Hive and Apache Iceberg, metastores can also serve as the purpose of transaction coordination.	Not great for data discovery, documentation, lineage, etc. Limited connections, usually are just for data lake architectures.
Business Catalogs	Maybe there is a better name for this category, but business catalogs take a more user-centric approach, offering business-friendly metadata and contextual information about data assets, such as definitions, owners, and usage guidelines, to help non-technical stakeholders discover and understand data. These catalogs usually have a much larger network of connectors and their UI applications far outshine any others on the market.	While they have a more comprehensive view of all data, they typically only enforce access controls to metadata rather than the root data itself.
Catalog of Catalogs	Many organizations have multiple catalogs. Some catalogs can plug into multiple other catalogs to gather all of the metadata across them into a unified view. Catalogs in this category usually operate as either “aggregators” - bring all data into one place for monitoring/control - or “delegators” with a unified view of all data to delegate operations or governance across each individual system. An example of an aggregator is Datahub, and an example of a delegator is Gravitino.	Being a layer above multiple catalogs, there will not be the same expressive feature depth of any of the individual catalogs. Depending on your team organization, this approach may be wonderful or challenging to harmonize.

When going through all catalogs, I found it hard to keep mutually exclusive categorizations. Most catalog of catalogs serve as either a metastore or business catalog, so below I separated some examples across just those lines to give you some idea:

If you want to unpack all the layers between metastores, catalogs, catalog of catalogs, semantic layers, etc, you definitely should tune in to this conversation “Taming the Chaos: A Deep Dive into Table Formats and Catalogs”. Shirshanka (Acryl) and Vinoth (Onehouse), who do a fun live exercise drawing up these lines and talking about the future development of products in this space.

Catalog vs Metastore

A prevalent misunderstanding I see across the community is the difference between a “data catalog” and a “metastore”. A metastore is a much more narrow service that is primarily focused on the governance of metadata. In the world of data lakes or the data lakehouse, a metastore is highly recommended, if not required. Data lakes by nature are a random collection of files. Metastores serve as a way to define and organize essential characteristics of your data including schema, table and database definitions, or even other constructs or statistics which make it faster to access data, like partitions. When query engines come to access or analyze data stored in a data lake, they usually communicate through a metastore. The following diagram shows a generalization:

It’s hard to say the word metastore without immediately thinking about the Hive Metastore (HMS). It’s also hard to hear the word Hive Metastore without immediately thinking about complaints and grumbles. The Hive Metastore was initially released in 2010 as part of the Apache Hive project developed by Facebook to facilitate data warehousing and SQL-like query capabilities on top of Hadoop. The following diagram shows a high level overview of the internals of HMS (note this is sourced from a good article in 2016 and many features have been added since):

While there are plenty of good reasons to complain about HMS, the fact of the matter is that the Hive Metastore still powers the majority of data lakes in production today. Your Starbucks coffee is made in some part thanks to HMS, your trades on Robinhood are facilitated with HMS, your Amazon packages are delivered because HMS has your back, you can work from home more comfortably with Notion thanks to HMS, and the list goes on and on. It is also important to note is the difference between the Hive Metastore, Hive Query Engine, and the Hive Table Format. Many people throw around the word Hive with broad generalizations. AWS Glue is a Hive Metastore, but that does not mean you are using Hive tables or the Hive query engine, or even Hive-style partitioning. When you created a Databricks workspace (before Unity Catalog in 2022), by default it deployed with a Databricks governed Hive Metastore. The majority of Iceberg deployments today use the Hive Metastore in combination with its table metadata stored on cloud storage and they still can leverage features like Iceberg partition evolution. I am just making this distinction about HMS because while the Hive table format and the Hive query engine are very much dinosaurs with little remaining usage, the Hive Metastore is unfortunately still almost unavoidable today.

Relationship of Iceberg, Hudi, Delta and Catalogs

When discussing metastores and catalogs from the point of view of data lakes, a key point that cannot be ignored is the relationship between catalogs and open table formats (OTFs) like Apache Iceberg™, Apache Hudi™, and Delta Lake. First off, with the common acronym I see used recently, “OTF”, don’t get confused between what a file “format” is (think parquet, avro, orc, proto, json), and what a table format might be. A more accurate characterization perhaps is an open table “metadata” format. Iceberg, Hudi, and Delta are metadata abstractions on top of mostly parquet files. For an in-depth introduction to what these projects are and how they are similar and different please reference these materials outside of this blog:

Delta Lake and Apache Hudi historically have not had a hard dependency on a metastore or catalog. Delta Lake is soon introducing their first requirement for a catalog in their implementation of multi-table transaction support. Watch this presentation from Databricks about the new “Co-ordinated Commits” feature available in preview on Delta Lake 4.0. Hudi is taking a different approach to this design and likely still avoiding the hard dependency on a catalog. Apache Iceberg however chose early on to take a strong dependency on a data catalog. There are pros and cons to this approach, it is not a wrong or bad design, just a community preference. I think this fact is important however to double click on Iceberg and catalogs to understand the current state of things like the Iceberg rest catalog APIs and now the Apache Polaris (Incubating) catalog.

The first question to understand is what does Iceberg rely on the catalog for? Alex Merced has a great intro here that calls out the need to maintain a list of the existing Iceberg tables and keep a reference to the "current" metadata.json file. Lisa Cao, also has an excellent read here that touches on one of the more important points related to ACID compliance and transaction coordination. For the nerdy readers who want to dive in, I recommend Jack VanLightly’s in-depth analysis on Iceberg's consistency model. This is one of the reasons why Snowflake has tough tradeoffs when using Iceberg between internal and external catalogs.

With the strong dependency on a catalog for transactional safety, Iceberg has good integrations with several catalog implementations including Hive Metastore, JDBC, Nessie. With the release Apache Iceberg version 0.14.0 in July 2022, the Iceberg project introduced the Iceberg REST Catalog API. Following the principle of the Hive Thrift service, the REST catalog API introduced a really great abstraction that allows you to use a single client to talk to any catalog implementation backend. This increased flexibility makes it easier for vendors from a diverse ecosystem of catalogs and query engines to build compatibility with Iceberg. The important point to understand here is that the Iceberg REST API, is the “specification” of how to communicate with a catalog, but you still need a catalog “implementation” like Glue, Gravitino, Unity, Polaris, etc. After the Tabular acquisition and their upcoming product deprecation there is now a mini race among vendors to fill the void and become the industry best Iceberg Catalog implementation on top of the new REST API specification.

How to choose a catalog?

Are you ready to dive deep into a detailed comparison matrix between catalogs? Before we get there, I want to make it clear that I don’t think there is one single “winner” or best catalog for every scenario. My goal is that my research here gives you a reference for both how to make your own evaluation and serves as a starting point of material for further reading. Please do your own research and make a choice tailored to your unique use case, data architecture, and goals. If you want to discuss these details live, I enjoy casual conversations about this topic, so reach out to me.

To evaluate which data catalog is right for you, I recommend you consider the following dimensions in your decision making:

Do the features meet your use case and medium term goals?
1. If your minimum needs are not met, it doesn’t matter how cool the product is or how cheap it is, don’t bother.
2. Where the interesting comparison comes is in the “nice to have” features. You have to evaluate the ROI of those features relative to other factors below of cost, complexity, etc.
Community
1. A vibrant community means that you will have support and content to leverage for help. When it is a community that shares similar goals or builds similar data architectures as yourself it means that your evolving needs are also more likely to be met.
2. I separate “community” from “open source” below because it is possible to build a strong community even without an open source project.
Open Source
1. If a catalog is open source, you can contribute, you can fork and customize, and there is likely a community of other developers who you can network with and discuss issues with.
2. When it comes to open source and catalogs I want to make sure there is a clear distinction between open source implementation and open source APIs as I believe this is a source of confusion. Let me use a few examples:
  1. The Apache Iceberg Catalog is a specification of open source REST APIs. The Apache Polaris Catalog is an open source server and implementation of open source APIs.
  2. The Snowflake Open Data Catalog is a closed source server and implementation of open source Iceberg APIs
  3. Unity Catalog is a bit confusing here. Up until earlier this year, Unity Catalog was only a Databricks product. This summer they created the Unity Catalog open source project and donated it to the Linux Foundation. The Unity APIs were made open source, but the catalog implementation seems forked. Unity Catalog implementation in OSS is not equivalent to the implementation within the Databricks product and I put them side-by-side in my comparison for you to see the differences.
3. Remember… not all open source is created equal. The key things to consider when evaluating the health of open source includes:
  1. Diversity of contributions - is it “open source”, but only one company runs the show? Or are there engineers from multiple companies and countries contributing?
  2. Project governance - who “owns” the project and what safeguards are in place to prevent the project from being taken over or changed without notice or community will. Remember the story from this year with the Benthos acquisition from RedPanda? The Apache Software foundation is the gold standard for strong open source governance. I had first-hand experience earlier this year participating in the donation of Apache XTable (Incubating).
  3. Licensing - read materials outside this blog for more details, but pay attention to licensing as it may restrict you from using the software in commercial applications or restrict you from making modifications.
  4. Enterprise support - everyone loves open source because it is free, but sometimes you need a company with expertise who will back you up. Consider if this is something that matters to you.
Ecosystem
1. Data Catalogs by nature need to cover an extremely large number of data sources, data types, and connectors. Missing support for your data sources is usually a deal breaker.
Cost
1. Cost of data catalogs can vary wildly depending on the use cases you are implementing. Pay close attention to the pricing models of different vendors.
2. Remember open source does not equal free, you will have infrastructure to deploy and maintain.
Time / Complexity
1. How much effort is involved to onboard, deploy, and more importantly maintain and operate the data catalog. Does it need central owners or distributed ownership? What level of automation exists to keep the catalog up to date versus internal processes you need to develop.

Comparisons

With the decision criteria above in mind, let's compare the catalogs on the market today. There are probably more than 50 credible catalogs on the market and unfortunately I do not have infinite time to compare them all, so here are the ones I selected:

Unity Catalog OSS - (v 0.2.0)
Unity Catalog Databricks Product (DBR 16.0)
Polaris - (v 0.9.0-rc1)
DataHub (v 0.14.1)
Glue (v4.0)
Gravitino (v0.7.0)
Atlan (v 4.7.49)

In this blog I divide up categories of features, describe what the feature should do and then I rate each catalog with an A-F of how well each catalog delivers on that feature. Each feature is unique and I roughly give the letters as follows: A = best solution, B = great solution, C = barely there, D = incomplete solution, F = missing functionality. After the 1-by-1 analysis I also created a metascore where I awarded 4pts for every A, 3 for B, 2 for C, 1 for D, and 0 for an F. The metascore is a sum of all points across all features and categories:

Is this a perfect scientifically measured score? No… for example, perhaps not all features should be given equal weights? Is it possible to create an objectively perfect score? I don’t think so… these are not products that you can run some queries, count the seconds it ran for and chalk up a TPC-DS. As mentioned before, what features matter to you, might not matter to another person. So take my analysis as a starting point to do your own research. A few of the ratings below could be interpreted as subjective, if you disagree with a rating, reach out to me and I would love to learn from your perspective and perhaps adapt my conclusions.

*(What I look like after doing this research…)*