December 10, 2024

Comprehensive Data Catalog Comparison

Comprehensive Data Catalog Comparison

Intro

The race to build the most comprehensive data catalog is accelerating as vendors compete to become the central hub for an organization’s data stack. With so many options on the market, it can be challenging to evaluate and choose the right catalog for your environment. At Onehouse, we don’t build or offer a catalog, instead we feature a multi-catalog synchronization utility. We have many users who use multiple catalogs, and others who ask us for advice and recommendations for what they should use. I recently hosted a panel discussion with some of the industry’s top minds building open source data catalogs: Unity Catalog, Apache Polaris (Incubating), DataHub, and Apache Gravitino

(https://opensourcedatasummit.com/data-catalogs-panel/)

As I started to do my own research I found many great blogs and resources, but I quickly realized that there was no neutral or comprehensive source of information which explained in depth what a catalog is and compared features across offerings.

Before you get deep into the read I want to call out a few things. First, the definition of a catalog is very broad, the use cases are diverse, and the needs of different organizations can be unique. The engineering communities I mostly engage in are focused on data engineering revolving around data lakes, or data lakehouse architecture. So rather than trying to create a comparison which evaluates catalogs from multiple user points of view, I will try to take the point of view of an organization who may be building a data platform where a data lake is a key component as the central store for most of its data. Lastly, as any good comparison article should call out, do your own research… I reference my research here throughout, but your goals, needs, and architecture are unique, there is no one single “winner” for everyone.

Since this blog is lengthy, use these anchor links to skip to the section that matters to you:

What is a Catalog

Table Formats and Catalogs

How to Choose a Catalog

Feature Comparison Ratings

When to use Which Catalog

What is a data catalog?

Basic description

In the most general sense, a data catalog is an organized inventory of data assets within an organization. It helps users discover, understand, and manage the data available for use. A data catalog typically includes metadata (information about the data) such as data sources, descriptions, owners, quality metrics, lineage, and access controls. It acts like a searchable directory for datasets, similar to how a library catalog organizes books.

(source: Atlan)

Use cases

Data Discovery and Exploration

If you have data, no matter if your company is large or small, even if you are the only data engineer in your company, you have a data discovery problem. Data catalogs play a key role in keeping track of and surfacing awareness of what data assets are available. Catalogs have a broad spectrum of effectiveness on this use case. In the most rudimentary form (more like a metastore), a catalog may simply be a place that you store a table name, schema, and reference to where it can be found. Others will have advanced crawlers and AI features exposing advanced search or proactive recommendations of datasets. Examples of features that matter for data discovery and exploration:

  1. Crawling to find all assets - how can a catalog gather knowledge of all data?
  2. Browse/Discover - how easily can users browse and discover what they need?
  3. Search - how easily can users intentionally search for specific data?
  4. Business Glossary - what additional metadata is available including tags, documentation, statistics, etc?
  5. Classification of data - can I classify data with certain labels for compliance/security?

Data Governance - Access Control

While data is a catalyst to innovation, it is also a potential liability that needs to be secured and safely governed. Since a data catalog has a comprehensive inventory of data across your estate, many catalogs also offer robust ways to author and maintain access control policies. Good access control needs to fulfill a delicate balance of rigorous control, but also provide a smooth experience for those granted permissions to access data without delays and friction. Examples of features that matter for data governance and access control:

  1. Policy authoring - how can a user express who has access to what data?
  2. Authentication and Authorization - how is a user identity validated, and how are credentials and secrets managed?
  3. Access enforcement with query engines - Is access to the root data in storage secured and governed by the catalog? When a query engine wants to access data can it go around the catalog and get it from storage anyways or does it have to leverage the catalog to be granted access at query time?

Data Governance - Compliance

In addition to standard access control, organizations are challenged to meet global compliance certifications such as SOC II, PCI/DSS, HITRUST, FedRAMP, etc, which require deeper knowledge about what data exists in what data stores. A key component is maintaining a data inventory which classifies data based on sensitivity and regulatory requirements, such as personal, confidential, or public categories. This classification helps organizations apply appropriate protections, such as retention policies, encryption for sensitive data or stricter access controls for personally identifiable information (PII). Regular audits and assessments ensure the inventory and classifications remain accurate and align with laws like GDPR, CCPA, or HIPAA. Examples of features that matter for compliance include:

  1. Data classification - can appropriate labels be placed on data to ensure they are safely handled according to internal policies.
  2. Retention policies - can automatic expiry, deletion, or prevention of the deletion of data be set to meet compliance standards?
  3. Auditing - is there logging of all activities kept for regulatory audits?

Data Lineage and Documentation

Do you remember the last time you were troubleshooting a data quality problem on your table? Maybe some unexpected data came in, but you don’t know where from? Data lineage provides a detailed map of how data flows through an organization—tracking its origin, transformations, and destinations. This visibility provides the context for identifying errors and ownership of data for accountability. Some data catalogs even support the documentation of or definition of metrics and aggregations of data. Comprehensive documentation complements lineage by recording the methods, tools, and standards used to process and analyze data, making it easier for teams to collaborate and troubleshoot. Examples of features that matter for lineage:

  1. Data Lineage - how well can you track the sources and dependencies of your data?
  2. Data ownership and accountability - is data ownership clear?
  3. Metric definitions - can metrics be defined and tracked?

Data Quality Management

Data quality management ensures that data is accurate, reliable, and fit for purpose. Some key features of data quality systems include:

  1. Schema Management - where engineers define and enforce structures for datasets, such as data types, constraints, and relationships, to maintain consistency and prevent errors during data integration or analysis. 
  2. Data Quality Policies - standards and rules for data accuracy, completeness, and validity, ensuring alignment with business and regulatory requirements. 
  3. Monitoring/Alerting - to maintain real-time reliability, engineers implement monitoring and alerting systems that continuously track metrics like error rates, anomalies, and missing values, notifying teams immediately when issues arise. 
  4. Data Freshness - ensuring that datasets are up-to-date and reflect the most current information.

The breadth of features and use cases for data catalogs is wide and diverse. If there are other categories of features you want to see analyzed, drop me message. I would love to add them: kyle@onehouse.ai 

Categories of catalogs

With so many catalogs on the market it is easy to get confused between different options. If you are new to this space, let me try to break down catalogs into a few high level categories:

Catalog Categories
Catalog Category What is it for? Tradeoffs
Metastores We will discuss metastores in much greater detail below, but the summary is that metastores focus on storing metadata about tables, schemas, and partitions, statistics, primarily for structured data in data lakes.
Metastores are used more as functional catalogs for query engines to reference more than they are used for the business to discover, learn, and govern data. In some cases such as with Apache Hive and Apache Iceberg, metastores can also serve as the purpose of transaction coordination.
Not great for data discovery, documentation, lineage, etc.
Limited connections, usually are just for data lake architectures.
Business Catalogs Maybe there is a better name for this category, but business catalogs take a more user-centric approach, offering business-friendly metadata and contextual information about data assets, such as definitions, owners, and usage guidelines, to help non-technical stakeholders discover and understand data.
These catalogs usually have a much larger network of connectors and their UI applications far outshine any others on the market.
While they have a more comprehensive view of all data, they typically only enforce access controls to metadata rather than the root data itself.
Catalog of Catalogs Many organizations have multiple catalogs. Some catalogs can plug into multiple other catalogs to gather all of the metadata across them into a unified view. Catalogs in this category usually operate as either “aggregators” - bring all data into one place for monitoring/control - or “delegators” with a unified view of all data to delegate operations or governance across each individual system. An example of an aggregator is Datahub, and an example of a delegator is Gravitino. Being a layer above multiple catalogs, there will not be the same expressive feature depth of any of the individual catalogs. Depending on your team organization, this approach may be wonderful or challenging to harmonize.

When going through all catalogs, I found it hard to keep mutually exclusive categorizations. Most catalog of catalogs serve as either a metastore or business catalog, so below I separated some examples across just those lines to give you some idea:

If you want to unpack all the layers between metastores, catalogs, catalog of catalogs, semantic layers, etc, you definitely should tune in to this conversation “Taming the Chaos: A Deep Dive into Table Formats and Catalogs”. Shirshanka (Acryl) and Vinoth (Onehouse), who do a fun live exercise drawing up these lines and talking about the future development of products in this space. 

Catalog vs Metastore

A prevalent misunderstanding I see across the community is the difference between a “data catalog” and a “metastore”. A metastore is a much more narrow service that is primarily focused on the governance of metadata. In the world of data lakes or the data lakehouse, a metastore is highly recommended, if not required. Data lakes by nature are a random collection of files. Metastores serve as a way to define and organize essential characteristics of your data including schema, table and database definitions, or even other constructs or statistics which make it faster to access data, like partitions. When query engines come to access or analyze data stored in a data lake, they usually communicate through a metastore. The following diagram shows a generalization:

It’s hard to say the word metastore without immediately thinking about the Hive Metastore (HMS). It’s also hard to hear the word Hive Metastore without immediately thinking about complaints and grumbles. The Hive Metastore was initially released in 2010 as part of the Apache Hive project developed by Facebook to facilitate data warehousing and SQL-like query capabilities on top of Hadoop. The following diagram shows a high level overview of the internals of HMS (note this is sourced from a good article in 2016 and many features have been added since):

While there are plenty of good reasons to complain about HMS, the fact of the matter is that the Hive Metastore still powers the majority of data lakes in production today. Your Starbucks coffee is made in some part thanks to HMS, your trades on Robinhood are facilitated with HMS, your Amazon packages are delivered because HMS has your back, you can work from home more comfortably with Notion thanks to HMS, and the list goes on and on. It is also important to note is the difference between the Hive Metastore, Hive Query Engine, and the Hive Table Format. Many people throw around the word Hive with broad generalizations. AWS Glue is a Hive Metastore, but that does not mean you are using Hive tables or the Hive query engine, or even Hive-style partitioning. When you created a Databricks workspace (before Unity Catalog in 2022), by default it deployed with a Databricks governed Hive Metastore. The majority of Iceberg deployments today use the Hive Metastore in combination with its table metadata stored on cloud storage and they still can leverage features like Iceberg partition evolution. I am just making this distinction about HMS because while the Hive table format and the Hive query engine are very much dinosaurs with little remaining usage, the Hive Metastore is unfortunately still almost unavoidable today.

(unfortunate reality of the HMS?)

Relationship of Iceberg, Hudi, Delta and Catalogs

When discussing metastores and catalogs from the point of view of data lakes, a key point that cannot be ignored is the relationship between catalogs and open table formats (OTFs) like Apache Iceberg, Apache Hudi, and Delta Lake. First off, with the common acronym I see used recently, “OTF”, don’t get confused between what a file “format” is (think parquet, avro, orc, proto, json), and what a table format might be. A more accurate characterization perhaps is an open table “metadata” format. Iceberg, Hudi, and Delta are metadata abstractions on top of mostly parquet files. For an in-depth introduction to what these projects are and how they are similar and different please reference these materials outside of this blog:

Delta Lake and Apache Hudi historically have not had a hard dependency on a metastore or catalog. Delta Lake is soon introducing their first requirement for a catalog in their implementation of multi-table transaction support. Watch this presentation from Databricks about the new “Co-ordinated Commits” feature available in preview on Delta Lake 4.0. Hudi is taking a different approach to this design and likely still avoiding the hard dependency on a catalog. Apache Iceberg however chose early on to take a strong dependency on a data catalog. There are pros and cons to this approach, it is not a wrong or bad design, just a community preference. I think this fact is important however to double click on Iceberg and catalogs to understand the current state of things like the Iceberg rest catalog APIs and now the Apache Polaris (Incubating) catalog.

The first question to understand is what does Iceberg rely on the catalog for? Alex Merced has a great intro here that calls out the need to maintain a list of the existing Iceberg tables and keep a reference to the "current" metadata.json file. Lisa Cao, also has an excellent read here that touches on one of the more important points related to ACID compliance and transaction coordination. For the nerdy readers who want to dive in, I recommend Jack VanLightly’s in-depth analysis on Iceberg's consistency model. This is one of the reasons why Snowflake has tough tradeoffs when using Iceberg between internal and external catalogs.

With the strong dependency on a catalog for transactional safety, Iceberg has good integrations with several catalog implementations including Hive Metastore, JDBC, Nessie. With the release Apache Iceberg version 0.14.0 in July 2022, the Iceberg project introduced the Iceberg REST Catalog API. Following the principle of the Hive Thrift service, the REST catalog API introduced a really great abstraction that allows you to use a single client to talk to any catalog implementation backend. This increased flexibility makes it easier for vendors from a diverse ecosystem of catalogs and query engines to build compatibility with Iceberg. The important point to understand here is that the Iceberg REST API, is the “specification” of how to communicate with a catalog, but you still need a catalog “implementation like Glue, Gravitino, Unity, Polaris, etc. After the Tabular acquisition and their upcoming product deprecation there is now a mini race among vendors to fill the void and become the industry best Iceberg Catalog implementation on top of the new REST API specification.

Example architecture from Datastrato

How to choose a catalog?

Are you ready to dive deep into a detailed comparison matrix between catalogs? Before we get there, I want to make it clear that I don’t think there is one single “winner” or best catalog for every scenario. My goal is that my research here gives you a reference for both how to make your own evaluation and serves as a starting point of material for further reading. Please do your own research and make a choice tailored to your unique use case, data architecture, and goals. If you want to discuss these details live, I enjoy casual conversations about this topic, so reach out to me.

To evaluate which data catalog is right for you, I recommend you consider the following dimensions in your decision making:

  1. Do the features meet your use case and medium term goals?
    1. If your minimum needs are not met, it doesn’t matter how cool the product is or how cheap it is, don’t bother. 
    2. Where the interesting comparison comes is in the “nice to have” features. You have to evaluate the ROI of those features relative to other factors below of cost, complexity, etc.
  2. Community
    1. A vibrant community means that you will have support and content to leverage for help. When it is a community that shares similar goals or builds similar data architectures as yourself it means that your evolving needs are also more likely to be met.
    2. I separate “community” from “open source” below because it is possible to build a strong community even without an open source project.
  3. Open Source
    1. If a catalog is open source, you can contribute, you can fork and customize, and there is likely a community of other developers who you can network with and discuss issues with.
    2. When it comes to open source and catalogs I want to make sure there is a clear distinction between open source implementation and open source APIs as I believe this is a source of confusion. Let me use a few examples:
      1. The Apache Iceberg Catalog is a specification of open source REST APIs. The Apache Polaris Catalog is an open source server and implementation of open source APIs.
      2. The Snowflake Open Data Catalog is a closed source server and implementation of open source Iceberg APIs
      3. Unity Catalog is a bit confusing here. Up until earlier this year, Unity Catalog was only a Databricks product. This summer they created the Unity Catalog open source project and donated it to the Linux Foundation. The Unity APIs were made open source, but the catalog implementation seems forked. Unity Catalog implementation in OSS is not equivalent to the implementation within the Databricks product and I put them side-by-side in my comparison for you to see the differences.
    3. Remember… not all open source is created equal. The key things to consider when evaluating the health of open source includes:
      1. Diversity of contributions - is it “open source”, but only one company runs the show? Or are there engineers from multiple companies and countries contributing?
      2. Project governance - who “owns” the project and what safeguards are in place to prevent the project from being taken over or changed without notice or community will. Remember the story from this year with the Benthos acquisition from RedPanda? The Apache Software foundation is the gold standard for strong open source governance. I had first-hand experience earlier this year participating in the donation of Apache XTable (Incubating).
      3. Licensing - read materials outside this blog for more details, but pay attention to licensing as it may restrict you from using the software in commercial applications or restrict you from making modifications.
      4. Enterprise support - everyone loves open source because it is free, but sometimes you need a company with expertise who will back you up. Consider if this is something that matters to you.
  4. Ecosystem
    1. Data Catalogs by nature need to cover an extremely large number of data sources, data types, and connectors. Missing support for your data sources is usually a deal breaker.
  5. Cost
    1. Cost of data catalogs can vary wildly depending on the use cases you are implementing. Pay close attention to the pricing models of different vendors.
    2. Remember open source does not equal free, you will have infrastructure to deploy and maintain.
  6. Time / Complexity
    1. How much effort is involved to onboard, deploy, and more importantly maintain and operate the data catalog. Does it need central owners or distributed ownership? What level of automation exists to keep the catalog up to date versus internal processes you need to develop.

Comparisons

With the decision criteria above in mind, let's compare the catalogs on the market today. There are probably more than 50 credible catalogs on the market and unfortunately I do not have infinite time to compare them all, so here are the ones I selected:

  1. Unity Catalog OSS - (v 0.2.0)
  2. Unity Catalog Databricks Product (DBR 16.0)
  3. Polaris - (v 0.9.0-rc1)
  4. DataHub (v 0.14.1)
  5. Glue (v4.0)
  6. Gravitino (v0.7.0)
  7. Atlan (v 4.7.49)

In this blog I divide up categories of features, describe what the feature should do and then I rate each catalog with an A-F of how well each catalog delivers on that feature. Each feature is unique and I roughly give the letters as follows: A = best solution, B = great solution, C = barely there, D = incomplete solution, F = missing functionality. After the 1-by-1 analysis I also created a metascore where I awarded 4pts for every A, 3 for B, 2 for C, 1 for D, and 0 for an F. The metascore is a sum of all points across all features and categories:

Is this a perfect scientifically measured score? No… for example, perhaps not all features should be given equal weights? Is it possible to create an objectively perfect score? I don’t think so… these are not products that you can run some queries, count the seconds it ran for and chalk up a TPC-DS. As mentioned before, what features matter to you, might not matter to another person. So take my analysis as a starting point to do your own research. A few of the ratings below could be interpreted as subjective, if you disagree with a rating, reach out to me and I would love to learn from your perspective and perhaps adapt my conclusions.

(What I look like after doing this research…)

Features

Data Discovery and Exploration

Data Catalog Comparison
Crawling
(How easy is it to register all my data in the catalog?)
Data details
(How much do I learn about the data in the catalog or glossary?)
Search / Discovery
(How easy is it to find what I’m looking for?)
Unity Catalog Unity Catalog OSS FManual table creation w/ schema definition CSchema and basic metadata CBasic catalog, schema, table tree with basic keyword search
Databricks Unity Catalog Databricks COnly tables created in Databricks automatically added, but not external data AClassifications, schema, stats, lineage, tags, docs, usage insights, etc ANavigational and intelligent search interprets schema, docs, stats
Polaris Apache Polaris FManual table creation CSchema and basic metadata CBasic catalog, namespace, table tree with basic keyword search
DataHub DataHub APush and pull methods to crawl and register data AClassifications, schema, statistics, lineage, tags, docs, DQ assertions, contracts, history, and more AAdvanced search with complex filters
Good browsing experience for discovery
Glue AWS Glue AGlue crawlers CSchema, partitions, indexes, and basic metadata CBasic catalog, db, table tree with basic keyword search
Gravitino Apache Gravitino CCan load existing tables in a catalog you connect CSchema and basic metadata CBasic catalog, schema, table tree with basic keyword search
Atlan Atlan ACrawlers for sources ASchema, lineage, stats, classifications, owners, tags, docs, history AAdvanced search with complex filters
Best browsing experience for easy personalized discovery

Best browsing experience for easy personalized discovery

Crawling

Some catalogs like Unity and Polaris require users to manually register tables and define the verbose schema. Others like Glue, Datahub, Atlan, offer crawlers that will automatically traverse your data and register it into your catalog.

Data Details

Atlan and Datahub have undeniably beautiful experiences to learn intimate details about your datasets with rich metadata, statistics, documentation, tags, business context and more. Others simply show the table schema and basic metadata like when the table was created.

Search/Discovery

Atlan and Datahub again have no competition with the other catalogs in my comparison scope. I enjoyed Atlan’s search the most as they have personalization features for the user.

Data Connectors and Objects Covered

Data Catalog Comparison
Open Table Formats
(which are supported?)
File types and unstructured data
(which are supported?)
Connectors
(breadth of ecosystem)
Notable connector gaps
(anything key missing?)
Unity Catalog Unity Catalog OSS ADelta, Hudi, Iceberg AParquet, ORC, Avro, CSV, JSON, or TEXT
Volumes for unstructured and semi-structured
CSources = Data lakes
Engines = 5
Data lakes only
No Flink or Trino
Databricks Unity Catalog Databricks CDelta, Iceberg
OSS Uniform and Unity can read Hudi, but not Databricks
AParquet, Avro, CSV, JSON, TSV, XML
Volumes
CSources = Data lakes
External engines = 5
Can access as Unity APIs, HMS APIs, Iceberg APIs
Data lakes only
No Flink, no Hudi
Polaris Apache Polaris DIceberg Only FIceberg tables only BSources = Data lakes
Engines = 20+
Nothing besides Iceberg
DataHub DataHub ADelta, Hudi, Iceberg CCSV, TSV, JSON, Parquet, Avro
No unstructured data
A75 Limited file formats
Glue AWS Glue ADelta, Hudi, Iceberg CCSV, Parquet, Avro, XML, JSON, ORC, Ion, grokLog
No unstructured data
CData Lakes
+12 other sources
Focused on AWS services and sources
Gravitino Apache Gravitino CHudi, Iceberg, Paimon AParquet, ORC, TXT, Avro, JSON
FileSets can be defined for unstructured objects
CSources = Data lakes
Engines = 5
Gravitino can also be an Iceberg catalog server
No Delta Lake
Atlan Atlan CDelta, Iceberg DSemi structured limited to JSON
No unstructured data
A55
w/ extensibility via rich SDKs
Limited file formats: No parquet, avro, proto

Open Table Formats

Only Glue, Datahub, and Unity support all 3 major open table formats. Notably, Polaris is Iceberg only, and Gravitino has Apache Paimon support, but no Delta Lake support likely because of the community roots in China.

File Types

Only Unity Catalog and Gravitino have dedicated features that focus on unstructured objects critical for AI scenarios. I was surprised by many basic file formats missing from various catalogs.

Connectors

Fitting more of the business catalog category, DataHub and Atlan vastly outnumber the others in terms of number of connectors. These catalogs not only cover data and databases, but they also cover other assets like jobs, dashboards, dbt models, etc.

Unity, Glue, Polaris, fitting more in the metastore category of course have a more narrow focus on data lakes. Since Apache Iceberg focused so heavily on the catalog API specification up front, now any engine compatible with Iceberg can be used with Polaris. Unity Catalog OSS describes in main readme that it supports Unity APIs, HMS APIs, and Iceberg APIs, but I can only find the documentation for how to do this with the Databricks product. Otherwise the OSS documentation calls out custom connectors for limited engines. Most notably missing in the OSS documentation were Trino and Flink, although Databricks product docs do mention Trino support. There is likely some room for improving clarity in this section that I’m sure the community can help me update.

Data Governance - Access Control

Data Catalog Policy Comparison
Control Policy Authoring
(how can user express who has access?)
Identity Authentication + Authorization
(how is identity and credentials managed?)
Data lake access enforcement
(is access to the root data enforced by the catalog?)
Query eng access enforcement
(if access to root data enforced, which engines are compatible?)
Unity Catalog Unity Catalog OSS CTable and other asset level access controls
Column lvl ACLs possible with shallow clone
AAuthentication with 3P IDPs: Google, Okta, etc
Authorization with IDP provided tokens
AVends temp storage credential to query engine DSpark, Daft, DuckDB, PuppyGraph, SpiceAI
Databricks Unity Catalog Databricks BRBAC with privileges for tables and other assets AAuthentication with 3P IDPs SCIM for groups and identity federation AVends temp storage credential to query engine DSpark, Fabric, DuckDB, Trino
Polaris Apache Polaris ARBAC with granular privileges for tables and other asset level access controls CClient ID and secret use OAuth2 for access token AVends temp storage credential to query engine AAll compatible with Iceberg REST Catalog API
DataHub DataHub FAccess control only on metadata, not the actual datasets CAuthentication with 3P IDPs: Google, Okta, etc+SCIM groupsPAT for all programmatic access FNo access enforcement on datasets FNo access enforcement on datasets
AWS Glue AWS Glue FAccess control only on metadata, not the actual datasets.
Can add LakeFormation for s3 data governance
AMature AWS IAM FNo access enforcement on datasets FNo access enforcement on datasets
Gravitino Apache Gravitino BRBAC with privileges for tables and other asset level access controls DClient ID and secret use OAuth2 for access token CRanger Authorization push-down CSpark, Flink, Trino
Gravitino can also be an Iceberg catalog server
Atlan Atlan DAccess control only on metadata, not the actual datasets
Atlan offers query service itself which does govern access to data, but external engines cannot reference Atlan for auth to access
AAuthentication with 3P IDPs: Google, Okta, etc
Authorization with RBAC, Roles, groups, policies
FNo access enforcement on datasets FNo access enforcement on datasets

This section is of particular interest to me, because I believe features in this category has the opportunity for the largest industry impact. The playbook that vendors have used for the last few decades is to lock a users data storage into their platform and keep them sticky to their compute services. The open table formats, Hudi, Delta, Iceberg have given rise to the lakehouse architecture which challenges this playbook and makes data storage open and interoperable. From my personal observations I believe vendors will now use access control with catalogs as their new lockin point aimed to keep users sticky to their platform. This is one of the reasons why Tabular fetched such a large acquisition purchase from Databricks.

When data catalog communities discuss access control the details sometimes get confusing. Generally of interest there is access control and governance of metadata (what shows up in the catalog) and then there is access control and governance of data itself (the root datasets in storage). In this section I focus on access control and governance of the root datasets, in particular data stored in data lakes. In this comparison scope, the only catalogs that offer this are Unity, Polaris, and Gravitino.

Policy authoring

Granting access to data is not as simple as yes/no I give you access to this table. What if I want to give you read only permission? What if I want to allow you to write to the table, but not alter the schema? What if I want you to have all actions, but I don’t want you sharing your permissions with others? Role-based-access-controls (RBAC) allow you to assign a “role” to an “identity” which can describe what granularity of access is allowed.

With Unity Catalog OSS you can create a user, add them to the catalog and then grant them certain privileges on certain objects. Apache Polaris offers a slightly superior experience with more granular privilege options available. Gravitino is very similar to Polaris, since they also implement the Iceberg REST Catalog API. I expect Unity Catalog to follow suit here and start folding on Unity APIs in favor of Iceberg Catalog APIs. Listen to the catalog panel discussion at timestamp 39:03 to hear discussion on this topic from Unity and Polaris leaders.

Authentication and Authorization

Authentication is the process of verifying an identity, authorization is the process of verifying that this identity is allowed access to the requested asset. This is where Unity Catalog OSS currently has an advantage over Polaris. Unity integrates with 3P identity providers like Google Auth, Okta, etc, and authorization is vetted with the IDP provided tokens. Polaris currently uses a clientId and secret with OAuth.

Data Lake Access 

Unity Catalog and Polaris both follow a similar design pattern for granting access to the root datasets in the data lake. Depending on your architecture, you can restrict all access to the root datasets from external systems and only give Unity or Polaris access. When a query engine wants access, it submits a query to the catalog for auth after which the catalog will vend the query engine a temporary access credential to the appropriate storage locations.

Query Engine Compatibility

To facilitate the tight coordination of credential vending between the catalog and the query engine, don’t expect all engines to work out of the box. Unity Catalog currently only lists 5 compatible query engines in their documentation which is very limiting (Spark, Daft, DuckDB, PuppyGraph, SpiceAI). Since Polaris extends the Iceberg Rest Catalog APIs, it becomes plug and play for a query engine which is already Iceberg compatible.

Data Governance - Compliance

Data Governance Table
Classification
(can I annotate and classify certain tables, columns?)
Retention Policies
(Can data retention policies be set and enforced?)
Auditing
(Can I audit all actions taken?)
Unity Catalog Unity Catalog OSS DKey/value properties for each asset FNo metadata or data lifecycle governance features FNo audit logs
Databricks Unity Catalog Databricks BTag-based classification and auto-detection of sensitive data FNo metadata or data lifecycle governance features AAudit logs
Polaris Apache Polaris FNo tagging or classification features FNo metadata or data lifecycle governance features FNo audit logs
DataHub DataHub AAuto classifier to detect sensitive data types. Also rich tagging features CMetadata retention, but no data lifecycle features AAudit logs in managed service
AWS Glue AWS Glue FNo classification, cannot tag tables or columns FNo metadata or data lifecycle governance features AAudit logs in CloudTrail
Gravitino Apache Gravitino FNo tagging or classification features FNo metadata or data lifecycle governance features FNo audit logs
Atlan Atlan CExpressive Tags for custom classifications at asset level FNo metadata or data lifecycle governance features ARich friendly UI to review history of activities plus raw audit logs

Data classification

Data classification is the process of organizing data by type or sensitivity most often for the purpose of ensuring  sensitive data assets have appropriate security measures. Common classification categories include public, internal, confidential, and highly sensitive data, each requiring specific access controls and compliance protocols.

DataHub and Databricks Unity Catalog (not OSS) offers the best experience here with auto classification in DataHub and proactive alerts in Databricks which can detect and automatically add classifications to your data.

(Databricks example UI)

Retention policies

Managing data retention policies is critical to meet compliance and regulatory requirements. Depending on the data sensitivity some data needs to be retained for 3yrs then deleted, some needs to be deleted after 30days etc. I was hoping to find features for retention in catalogs, but it seems like a gap which may be useful for one of these projects to add on their roadmap.

Auditing

When something breaks, or a policy is breached, or most commonly just for routine compliance, it is important to store a full history of all actions taken on data and metadata. Atlan offers the best experience here with a rich UI in addition to the raw audit logs.

Data Lineage and Documentation

Data Governance Table
Data Lineage
(How well can the source of data be tracked?)
Data Ownership Accountability
(Is data ownership clear? How is accountability maintained?)
Metric Definitions
(Can metrics be clearly defined, traced, and monitored?)
Unity Catalog Unity Catalog OSS FNo data lineage F ACreate, govern and leverage "functions"
Databricks Unity Catalog Databricks BRich visualizations, lineage, only from Spark DF or DBSQL AOwnership tracked and visible AUser defined functions
Polaris Apache Polaris FNo data lineage F FNo metric definitions, but can create a view
DataHub DataHub AAuto lineage extraction, rich visualizations and column lvl lineage AOwners and ownership roles CNo metric entity, but business glossary allows docs
AWS Glue AWS Glue FNo data lineage, you can use datazone F F
Gravitino Apache Gravitino FNo lineage F F
Atlan Atlan AAuto lineage extraction, rich visualizations and column lvl lineage AAssign owners to assets CNo metric entity, but business glossary allows docs

Data Lineage

Data lineage provides a clear map of where data originates, how it moves through systems, and how it’s transformed over time. This visibility helps organizations troubleshoot errors, maintain compliance, and ensure data quality by identifying bottlenecks or inaccuracies. I was surprised to find that most of the catalogs in my selection scope were missing lineage. DataHub and Atlan have beautiful solutions here.

Data ownership and accountability

To some, this attribute may seem trivial, but when it is simple and accessible, identifying data ownership and accountability can save hours of searching in larger organizations. In combination with data lineage, if owners are assigned, you can quickly navigate through a root cause analysis and rapidly reach out to the necessary teams to raise issues or collaborate on remediation of data issues.

Metric definitions

Sometimes different teams come up with different definitions for business metrics like active users, churn, or engagement. Confusion on metric definitions can lead to mistakes in how data pipelines are created or maintained. In addition to lineage of a table or a column, having centralized knowledge of how a metric is calculated and it’s dependencies, can ensure firm data contracts are maintained for data that matters most.

Data Quality Management

Schema and Data Quality Comparison
Schema Management
(can schema be registered, monitored, or enforced?)
Data Quality policies
(Can data quality expectations be set and monitored?)
Monitoring + Alerting
(How well can you monitor and be made aware when there is a problem?)
Data Freshness
(Can you monitor how recently your data has been updated?)
Unity Catalog Unity Catalog OSS CBasic schema evolution F F F
Databricks Unity Catalog Databricks CBasic schema evolution FData Constraints, Expectations, etc can be set in Databricks pipelines, not a feature of UC alone FSlack, webhooks, etc, for Lakehouse Monitoring, not a feature of UC F
Polaris Apache Polaris CBasic schema evolution F F F
DataHub DataHub CSchema history viewer DOnly Snowflake DMFs
Unique future direction to create universal DQ assertion spec that can sync across 3P
CConfigurable Notifications and basic slack integration DOnly Snowflake DMFs
AWS Glue AWS Glue AFull schema registry AAdvanced serverless data quality tool based on oss DeeQu AEvents can be sent to event bridge AData freshness rules that measure date column
Gravitino Apache Gravitino CBasic schema evolution F F F
Atlan Atlan DSchema changes tracked for sql sources only F CNotifications/alerts for metadata job failures
Can build workflows to notify data consumers if data changes
DOnly looks at metadata freshness (ctr+F “freshness”)

Schema management

All of the catalogs in my comparison scope allowed and tracked schema evolution on tables (except for source limitations documented for Atlan). AWS Glue takes schema management to the next level by offering a full fledged schema registry. Read more about why a schema registry is valuable.

Data Quality Expectations

Have you read about the open source project Great Expectations? It is essentially expressive and extensible unit tests for your data. Set expectations and monitor/enforce that they are met. Similarly data catalogs often talk about managing data quality, but the only catalog I found with these features built in was AWS Glue who has leveraged the OSS project DeeQu.

Monitoring/Alerting

If a catalog knows everything about your data, can it proactively reach you if something is awry? Aside from in-app notifications, can it reach you with alerts via different mediums? DataHub allows you to configure basic alerts to come through Slack. Atlan allows you to build custom workflows that can trigger alerts on certain conditions. Glue integrates with AWS EventBridge which has many ways to send alerts.

Data freshness

For data to drive decisions, it needs to be consistently on time. Some use cases require stringent requirements for data freshness to lag no longer than 1min, while others may be satisfied with 24h. Regardless of the window, it is important to monitor and track this and surprisingly in my research scope, only AWS Glue offered this capability.

Open Source

Project Metrics Comparison
Github Stars
(An indication of how popular a project is)
Issues / PRs
(How many developers contribute issues or PRs to the project)
# Contributors
(How many people contributed)
Companies contributing
(What companies contribute to the project)
Unity Catalog Unity Catalog 2.4k 52 Merged PRs
16 closed issues
19 authors Mostly Databricks
Polaris Apache Polaris 1.2k 50 Merged PRs
7 closed issues
14 authors Mostly Snowflake
DataHub DataHub 9.9k 156 Merged PRs
16 closed issues
45 authors Mostly Acryl and Linkedin
Hive Metastore Hive Metastore
(foundation of AWS Glue)
5.6k 23 Merged PRs
0 closed issues
(also 0 open issues)
20 authors Lately Cloudera, Microsoft, Hortonworks seem largest
AWS Glue AWS Glue N/A N/A N/A N/A
Gravitino Apache Gravitino 1.1k 235 merged PRs
183 closed issues
38 authors Mostly Datastrato and Xiaomi
Atlan Atlan N/A N/A N/A N/A

(⚠️ Data changes rapidly! Table is last 30 days as of 11/20/24)

When to use what?

Did you find the catalog that is right for you yet? One thing I did not expect when doing this research was coming to the unfortunate realization that I might need more than one to cover all the bases for a complete data platform solution.

Regardless of the final metascore rankings, here are my recommendations for which situations I would pick which catalog(s).

The first recommendation is easy… If I’m a Databricks customer, there is no question in my mind that I should use the Databricks Unity Catalog. Being a Databricks customer I don’t need to worry about missing functionality in the OSS version. Having the Unity APIs as open source is slightly comforting, but I believe Databricks and OSS UC APIs will converge on the Iceberg catalog APIs, one day leaving the Unity APIs behind. Remember convergence on the catalog APIs is much different than convergence of Delta and Iceberg table formats themselves, which I believe will remain distinct.

In the catalog space, there is a large feature divide between “metastores” and “business catalogs”. The Databricks product Unity Catalog is the only one in my mind that, in addition to being a full fledged metastore, is close to pulling off and fulfilling many of the use cases needed from a business catalog. The main gap I see is around the ecosystem completeness and breadth of connectors. If you have a large Databricks product gravity center in your architecture it will work, but if Databricks is only a side-play in your architecture, evaluate with care.

If you need a metastore and want something more fulfilling than the Hive Metastore, maybe consider Unity Catalog OSS. I would be tempted to use this if I had a strong need to leverage the access control mechanisms it offers, but the tradeoff is the limited query engines that are supported so plan ahead accordingly. If you need features that a business catalog has to offer, you might need to supplement UC with another catalog.

Apache Polaris (Incubating) is still <1yrs old, so the current feature set and experience is very limited and will likely rapidly improve over the years. With Iceberg’s hard dependency on a data catalog, if you use Iceberg today, what other catalog are you using today? If you want to go down a self-managed OSS only route and your entire estate is Iceberg only, then consider Polaris for the good access control features. If you need features that a business catalog has to offer, you might need to supplement Polaris with another catalog.

If you have a diverse data estate with a sprawl of different tools and are looking for more of a business catalog to help all roles in your organization collaborate and leverage your data, then I would recommend you consider DataHub or Atlan. If you want an open source focused route, DataHub is your sure ticket with a strong community. If you don’t want to self-host DIY, you can also rely on Acryl for a managed solution. If you have a large gravity center around data lakes and you need stronger governance, you may need to supplement this with another metastore.

Atlan product experiences are beautiful. If you have similar goals like I described with DataHub above, and you want simplicity and easy productivity for your entire organization, Atlan is a strong choice. Atlan functionality seems to be strongest if data warehouses are the gravity center in your organization, but maybe weaker if a data lake is your focus where you may need a supplementary metastore. There are many other strong business catalogs that I did not cover in my analysis. So if you are here and leaning towards Atlan, I would recommend you also should investigate Alation, Collibra, or Informatica in parallel for comparison.

If you are on AWS, using some of their managed analytics services, then Glue is a no-brainer. It is so easy and it seamlessly falls into the background when you are stitching together solutions between AWS services. The only question people may wonder here is if they should use one of the other metastores listed here instead of Glue? In my opinion, no… unless you have a tough multi-cloud architecture. Most of the features that are missing for Glue are because another AWS solution fills those gaps. An example of a large gap for Glue is access control, which AWS LakeFormation takes care of even better than the other catalogs analyzed here. Even under multi-cloud architecture, these metastore limitations are also covered with GCP BigLake and Azure OneLake Catalog.

AWS Glue is essentially a managed Hive Metastore. Despite the negative sentiment it receives as mentioned above you can still use modern table formats, Hudi, Iceberg, and Delta and be free from the limitations of Hive. The Hive Metastore is still in my opinion the most sure shot, easy choice for a bare bones metastore. You need to get a data lake up and running and don’t need many frills, grab an HMS and you can supplement it with a business catalog later.

In Summary

In 2025 as the ecosystem of catalogs continues to grow and mature, one of the dangers we need to collectively avoid as a community is vendor lockin. What if I need my metadata available in multiple catalogs or I decide to switch one day? What if I author all of my access control policies in one catalog, will it be possible to reuse these policies in another? What is your 5 year prediction from today? Do you think 1 catalog will eventually rise as the only one we need, or do you think we might need something like an “XCatalog” similar to how XTable can translate and sync between table formats?

As mentioned previously I don’t think there is one single “winner” or a single best catalog for every use case today. I hope my research here serves as a starting point for you to do your own research and make a choice tailored to your unique data architecture and goals. 

I am certain I have made human errors in my analysis here. If you spot something I got wrong, or you want me to add another section, or feature, or catalog into the comparison, drop me a line: kyle@onehouse.ai and I will try to add updates as I can.

At Onehouse we embrace all of these catalogs and offer a unique Multi-Catalog Sync utility that ensures all of your lakehouse tables are available to any catalog and query engine of your choice. If you want to talk about catalogs, query engines, table formats, or any other topic, then reach out to us at gtm@onehouse.ai

Authors
No items found.

Read More:

The First Open Source Data Summit is a Hit!
OneTable is Now Open Source
On “Iceberg and Hudi ACID Guarantees”
Maximizing Change Data Capture
It’s Time for the Universal Data Lakehouse‍

Subscribe to the Blog

Be the first to read new posts

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
We are hiring diverse, world-class talent — join us in building the future