The race to build the most comprehensive data catalog is accelerating as vendors compete to become the central hub for an organization’s data stack. With so many options on the market, it can be challenging to evaluate and choose the right catalog for your environment. At Onehouse, we don’t build or offer a catalog, instead we feature a multi-catalog synchronization utility. We have many users who use multiple catalogs, and others who ask us for advice and recommendations for what they should use. I recently hosted a panel discussion with some of the industry’s top minds building open source data catalogs: Unity Catalog, Apache Polaris (Incubating), DataHub, and Apache Gravitino.
As I started to do my own research I found many great blogs and resources, but I quickly realized that there was no neutral or comprehensive source of information which explained in depth what a catalog is and compared features across offerings.
Before you get deep into the read I want to call out a few things. First, the definition of a catalog is very broad, the use cases are diverse, and the needs of different organizations can be unique. The engineering communities I mostly engage in are focused on data engineering revolving around data lakes, or data lakehouse architecture. So rather than trying to create a comparison which evaluates catalogs from multiple user points of view, I will try to take the point of view of an organization who may be building a data platform where a data lake is a key component as the central store for most of its data. Lastly, as any good comparison article should call out, do your own research… I reference my research here throughout, but your goals, needs, and architecture are unique, there is no one single “winner” for everyone.
Since this blog is lengthy, use these anchor links to skip to the section that matters to you:
In the most general sense, a data catalog is an organized inventory of data assets within an organization. It helps users discover, understand, and manage the data available for use. A data catalog typically includes metadata (information about the data) such as data sources, descriptions, owners, quality metrics, lineage, and access controls. It acts like a searchable directory for datasets, similar to how a library catalog organizes books.
If you have data, no matter if your company is large or small, even if you are the only data engineer in your company, you have a data discovery problem. Data catalogs play a key role in keeping track of and surfacing awareness of what data assets are available. Catalogs have a broad spectrum of effectiveness on this use case. In the most rudimentary form (more like a metastore), a catalog may simply be a place that you store a table name, schema, and reference to where it can be found. Others will have advanced crawlers and AI features exposing advanced search or proactive recommendations of datasets. Examples of features that matter for data discovery and exploration:
While data is a catalyst to innovation, it is also a potential liability that needs to be secured and safely governed. Since a data catalog has a comprehensive inventory of data across your estate, many catalogs also offer robust ways to author and maintain access control policies. Good access control needs to fulfill a delicate balance of rigorous control, but also provide a smooth experience for those granted permissions to access data without delays and friction. Examples of features that matter for data governance and access control:
In addition to standard access control, organizations are challenged to meet global compliance certifications such as SOC II, PCI/DSS, HITRUST, FedRAMP, etc, which require deeper knowledge about what data exists in what data stores. A key component is maintaining a data inventory which classifies data based on sensitivity and regulatory requirements, such as personal, confidential, or public categories. This classification helps organizations apply appropriate protections, such as retention policies, encryption for sensitive data or stricter access controls for personally identifiable information (PII). Regular audits and assessments ensure the inventory and classifications remain accurate and align with laws like GDPR, CCPA, or HIPAA. Examples of features that matter for compliance include:
Do you remember the last time you were troubleshooting a data quality problem on your table? Maybe some unexpected data came in, but you don’t know where from? Data lineage provides a detailed map of how data flows through an organization—tracking its origin, transformations, and destinations. This visibility provides the context for identifying errors and ownership of data for accountability. Some data catalogs even support the documentation of or definition of metrics and aggregations of data. Comprehensive documentation complements lineage by recording the methods, tools, and standards used to process and analyze data, making it easier for teams to collaborate and troubleshoot. Examples of features that matter for lineage:
Data quality management ensures that data is accurate, reliable, and fit for purpose. Some key features of data quality systems include:
The breadth of features and use cases for data catalogs is wide and diverse. If there are other categories of features you want to see analyzed, drop me message. I would love to add them: kyle@onehouse.ai
With so many catalogs on the market it is easy to get confused between different options. If you are new to this space, let me try to break down catalogs into a few high level categories:
When going through all catalogs, I found it hard to keep mutually exclusive categorizations. Most catalog of catalogs serve as either a metastore or business catalog, so below I separated some examples across just those lines to give you some idea:
If you want to unpack all the layers between metastores, catalogs, catalog of catalogs, semantic layers, etc, you definitely should tune in to this conversation “Taming the Chaos: A Deep Dive into Table Formats and Catalogs”. Shirshanka (Acryl) and Vinoth (Onehouse), who do a fun live exercise drawing up these lines and talking about the future development of products in this space.
A prevalent misunderstanding I see across the community is the difference between a “data catalog” and a “metastore”. A metastore is a much more narrow service that is primarily focused on the governance of metadata. In the world of data lakes or the data lakehouse, a metastore is highly recommended, if not required. Data lakes by nature are a random collection of files. Metastores serve as a way to define and organize essential characteristics of your data including schema, table and database definitions, or even other constructs or statistics which make it faster to access data, like partitions. When query engines come to access or analyze data stored in a data lake, they usually communicate through a metastore. The following diagram shows a generalization:
It’s hard to say the word metastore without immediately thinking about the Hive Metastore (HMS). It’s also hard to hear the word Hive Metastore without immediately thinking about complaints and grumbles. The Hive Metastore was initially released in 2010 as part of the Apache Hive project developed by Facebook to facilitate data warehousing and SQL-like query capabilities on top of Hadoop. The following diagram shows a high level overview of the internals of HMS (note this is sourced from a good article in 2016 and many features have been added since):
While there are plenty of good reasons to complain about HMS, the fact of the matter is that the Hive Metastore still powers the majority of data lakes in production today. Your Starbucks coffee is made in some part thanks to HMS, your trades on Robinhood are facilitated with HMS, your Amazon packages are delivered because HMS has your back, you can work from home more comfortably with Notion thanks to HMS, and the list goes on and on. It is also important to note is the difference between the Hive Metastore, Hive Query Engine, and the Hive Table Format. Many people throw around the word Hive with broad generalizations. AWS Glue is a Hive Metastore, but that does not mean you are using Hive tables or the Hive query engine, or even Hive-style partitioning. When you created a Databricks workspace (before Unity Catalog in 2022), by default it deployed with a Databricks governed Hive Metastore. The majority of Iceberg deployments today use the Hive Metastore in combination with its table metadata stored on cloud storage and they still can leverage features like Iceberg partition evolution. I am just making this distinction about HMS because while the Hive table format and the Hive query engine are very much dinosaurs with little remaining usage, the Hive Metastore is unfortunately still almost unavoidable today.
When discussing metastores and catalogs from the point of view of data lakes, a key point that cannot be ignored is the relationship between catalogs and open table formats (OTFs) like Apache Iceberg, Apache Hudi, and Delta Lake. First off, with the common acronym I see used recently, “OTF”, don’t get confused between what a file “format” is (think parquet, avro, orc, proto, json), and what a table format might be. A more accurate characterization perhaps is an open table “metadata” format. Iceberg, Hudi, and Delta are metadata abstractions on top of mostly parquet files. For an in-depth introduction to what these projects are and how they are similar and different please reference these materials outside of this blog:
Delta Lake and Apache Hudi historically have not had a hard dependency on a metastore or catalog. Delta Lake is soon introducing their first requirement for a catalog in their implementation of multi-table transaction support. Watch this presentation from Databricks about the new “Co-ordinated Commits” feature available in preview on Delta Lake 4.0. Hudi is taking a different approach to this design and likely still avoiding the hard dependency on a catalog. Apache Iceberg however chose early on to take a strong dependency on a data catalog. There are pros and cons to this approach, it is not a wrong or bad design, just a community preference. I think this fact is important however to double click on Iceberg and catalogs to understand the current state of things like the Iceberg rest catalog APIs and now the Apache Polaris (Incubating) catalog.
The first question to understand is what does Iceberg rely on the catalog for? Alex Merced has a great intro here that calls out the need to maintain a list of the existing Iceberg tables and keep a reference to the "current" metadata.json file. Lisa Cao, also has an excellent read here that touches on one of the more important points related to ACID compliance and transaction coordination. For the nerdy readers who want to dive in, I recommend Jack VanLightly’s in-depth analysis on Iceberg's consistency model. This is one of the reasons why Snowflake has tough tradeoffs when using Iceberg between internal and external catalogs.
With the strong dependency on a catalog for transactional safety, Iceberg has good integrations with several catalog implementations including Hive Metastore, JDBC, Nessie. With the release Apache Iceberg version 0.14.0 in July 2022, the Iceberg project introduced the Iceberg REST Catalog API. Following the principle of the Hive Thrift service, the REST catalog API introduced a really great abstraction that allows you to use a single client to talk to any catalog implementation backend. This increased flexibility makes it easier for vendors from a diverse ecosystem of catalogs and query engines to build compatibility with Iceberg. The important point to understand here is that the Iceberg REST API, is the “specification” of how to communicate with a catalog, but you still need a catalog “implementation” like Glue, Gravitino, Unity, Polaris, etc. After the Tabular acquisition and their upcoming product deprecation there is now a mini race among vendors to fill the void and become the industry best Iceberg Catalog implementation on top of the new REST API specification.
Are you ready to dive deep into a detailed comparison matrix between catalogs? Before we get there, I want to make it clear that I don’t think there is one single “winner” or best catalog for every scenario. My goal is that my research here gives you a reference for both how to make your own evaluation and serves as a starting point of material for further reading. Please do your own research and make a choice tailored to your unique use case, data architecture, and goals. If you want to discuss these details live, I enjoy casual conversations about this topic, so reach out to me.
To evaluate which data catalog is right for you, I recommend you consider the following dimensions in your decision making:
With the decision criteria above in mind, let's compare the catalogs on the market today. There are probably more than 50 credible catalogs on the market and unfortunately I do not have infinite time to compare them all, so here are the ones I selected:
In this blog I divide up categories of features, describe what the feature should do and then I rate each catalog with an A-F of how well each catalog delivers on that feature. Each feature is unique and I roughly give the letters as follows: A = best solution, B = great solution, C = barely there, D = incomplete solution, F = missing functionality. After the 1-by-1 analysis I also created a metascore where I awarded 4pts for every A, 3 for B, 2 for C, 1 for D, and 0 for an F. The metascore is a sum of all points across all features and categories:
Is this a perfect scientifically measured score? No… for example, perhaps not all features should be given equal weights? Is it possible to create an objectively perfect score? I don’t think so… these are not products that you can run some queries, count the seconds it ran for and chalk up a TPC-DS. As mentioned before, what features matter to you, might not matter to another person. So take my analysis as a starting point to do your own research. A few of the ratings below could be interpreted as subjective, if you disagree with a rating, reach out to me and I would love to learn from your perspective and perhaps adapt my conclusions.
Best browsing experience for easy personalized discovery
Crawling
Some catalogs like Unity and Polaris require users to manually register tables and define the verbose schema. Others like Glue, Datahub, Atlan, offer crawlers that will automatically traverse your data and register it into your catalog.
Data Details
Atlan and Datahub have undeniably beautiful experiences to learn intimate details about your datasets with rich metadata, statistics, documentation, tags, business context and more. Others simply show the table schema and basic metadata like when the table was created.
Search/Discovery
Atlan and Datahub again have no competition with the other catalogs in my comparison scope. I enjoyed Atlan’s search the most as they have personalization features for the user.
Open Table Formats
Only Glue, Datahub, and Unity support all 3 major open table formats. Notably, Polaris is Iceberg only, and Gravitino has Apache Paimon support, but no Delta Lake support likely because of the community roots in China.
File Types
Only Unity Catalog and Gravitino have dedicated features that focus on unstructured objects critical for AI scenarios. I was surprised by many basic file formats missing from various catalogs.
Connectors
Fitting more of the business catalog category, DataHub and Atlan vastly outnumber the others in terms of number of connectors. These catalogs not only cover data and databases, but they also cover other assets like jobs, dashboards, dbt models, etc.
Unity, Glue, Polaris, fitting more in the metastore category of course have a more narrow focus on data lakes. Since Apache Iceberg focused so heavily on the catalog API specification up front, now any engine compatible with Iceberg can be used with Polaris. Unity Catalog OSS describes in main readme that it supports Unity APIs, HMS APIs, and Iceberg APIs, but I can only find the documentation for how to do this with the Databricks product. Otherwise the OSS documentation calls out custom connectors for limited engines. Most notably missing in the OSS documentation were Trino and Flink, although Databricks product docs do mention Trino support. There is likely some room for improving clarity in this section that I’m sure the community can help me update.
This section is of particular interest to me, because I believe features in this category has the opportunity for the largest industry impact. The playbook that vendors have used for the last few decades is to lock a users data storage into their platform and keep them sticky to their compute services. The open table formats, Hudi, Delta, Iceberg have given rise to the lakehouse architecture which challenges this playbook and makes data storage open and interoperable. From my personal observations I believe vendors will now use access control with catalogs as their new lockin point aimed to keep users sticky to their platform. This is one of the reasons why Tabular fetched such a large acquisition purchase from Databricks.
When data catalog communities discuss access control the details sometimes get confusing. Generally of interest there is access control and governance of metadata (what shows up in the catalog) and then there is access control and governance of data itself (the root datasets in storage). In this section I focus on access control and governance of the root datasets, in particular data stored in data lakes. In this comparison scope, the only catalogs that offer this are Unity, Polaris, and Gravitino.
Policy authoring
Granting access to data is not as simple as yes/no I give you access to this table. What if I want to give you read only permission? What if I want to allow you to write to the table, but not alter the schema? What if I want you to have all actions, but I don’t want you sharing your permissions with others? Role-based-access-controls (RBAC) allow you to assign a “role” to an “identity” which can describe what granularity of access is allowed.
With Unity Catalog OSS you can create a user, add them to the catalog and then grant them certain privileges on certain objects. Apache Polaris offers a slightly superior experience with more granular privilege options available. Gravitino is very similar to Polaris, since they also implement the Iceberg REST Catalog API. I expect Unity Catalog to follow suit here and start folding on Unity APIs in favor of Iceberg Catalog APIs. Listen to the catalog panel discussion at timestamp 39:03 to hear discussion on this topic from Unity and Polaris leaders.
Authentication and Authorization
Authentication is the process of verifying an identity, authorization is the process of verifying that this identity is allowed access to the requested asset. This is where Unity Catalog OSS currently has an advantage over Polaris. Unity integrates with 3P identity providers like Google Auth, Okta, etc, and authorization is vetted with the IDP provided tokens. Polaris currently uses a clientId and secret with OAuth.
Data Lake Access
Unity Catalog and Polaris both follow a similar design pattern for granting access to the root datasets in the data lake. Depending on your architecture, you can restrict all access to the root datasets from external systems and only give Unity or Polaris access. When a query engine wants access, it submits a query to the catalog for auth after which the catalog will vend the query engine a temporary access credential to the appropriate storage locations.
Query Engine Compatibility
To facilitate the tight coordination of credential vending between the catalog and the query engine, don’t expect all engines to work out of the box. Unity Catalog currently only lists 5 compatible query engines in their documentation which is very limiting (Spark, Daft, DuckDB, PuppyGraph, SpiceAI). Since Polaris extends the Iceberg Rest Catalog APIs, it becomes plug and play for a query engine which is already Iceberg compatible.
Data classification
Data classification is the process of organizing data by type or sensitivity most often for the purpose of ensuring sensitive data assets have appropriate security measures. Common classification categories include public, internal, confidential, and highly sensitive data, each requiring specific access controls and compliance protocols.
DataHub and Databricks Unity Catalog (not OSS) offers the best experience here with auto classification in DataHub and proactive alerts in Databricks which can detect and automatically add classifications to your data.
Retention policies
Managing data retention policies is critical to meet compliance and regulatory requirements. Depending on the data sensitivity some data needs to be retained for 3yrs then deleted, some needs to be deleted after 30days etc. I was hoping to find features for retention in catalogs, but it seems like a gap which may be useful for one of these projects to add on their roadmap.
Auditing
When something breaks, or a policy is breached, or most commonly just for routine compliance, it is important to store a full history of all actions taken on data and metadata. Atlan offers the best experience here with a rich UI in addition to the raw audit logs.
Data Lineage
Data lineage provides a clear map of where data originates, how it moves through systems, and how it’s transformed over time. This visibility helps organizations troubleshoot errors, maintain compliance, and ensure data quality by identifying bottlenecks or inaccuracies. I was surprised to find that most of the catalogs in my selection scope were missing lineage. DataHub and Atlan have beautiful solutions here.
Data ownership and accountability
To some, this attribute may seem trivial, but when it is simple and accessible, identifying data ownership and accountability can save hours of searching in larger organizations. In combination with data lineage, if owners are assigned, you can quickly navigate through a root cause analysis and rapidly reach out to the necessary teams to raise issues or collaborate on remediation of data issues.
Metric definitions
Sometimes different teams come up with different definitions for business metrics like active users, churn, or engagement. Confusion on metric definitions can lead to mistakes in how data pipelines are created or maintained. In addition to lineage of a table or a column, having centralized knowledge of how a metric is calculated and it’s dependencies, can ensure firm data contracts are maintained for data that matters most.
Schema management
All of the catalogs in my comparison scope allowed and tracked schema evolution on tables (except for source limitations documented for Atlan). AWS Glue takes schema management to the next level by offering a full fledged schema registry. Read more about why a schema registry is valuable.
Data Quality Expectations
Have you read about the open source project Great Expectations? It is essentially expressive and extensible unit tests for your data. Set expectations and monitor/enforce that they are met. Similarly data catalogs often talk about managing data quality, but the only catalog I found with these features built in was AWS Glue who has leveraged the OSS project DeeQu.
Monitoring/Alerting
If a catalog knows everything about your data, can it proactively reach you if something is awry? Aside from in-app notifications, can it reach you with alerts via different mediums? DataHub allows you to configure basic alerts to come through Slack. Atlan allows you to build custom workflows that can trigger alerts on certain conditions. Glue integrates with AWS EventBridge which has many ways to send alerts.
Data freshness
For data to drive decisions, it needs to be consistently on time. Some use cases require stringent requirements for data freshness to lag no longer than 1min, while others may be satisfied with 24h. Regardless of the window, it is important to monitor and track this and surprisingly in my research scope, only AWS Glue offered this capability.
(⚠️ Data changes rapidly! Table is last 30 days as of 11/20/24)
Did you find the catalog that is right for you yet? One thing I did not expect when doing this research was coming to the unfortunate realization that I might need more than one to cover all the bases for a complete data platform solution.
Regardless of the final metascore rankings, here are my recommendations for which situations I would pick which catalog(s).
The first recommendation is easy… If I’m a Databricks customer, there is no question in my mind that I should use the Databricks Unity Catalog. Being a Databricks customer I don’t need to worry about missing functionality in the OSS version. Having the Unity APIs as open source is slightly comforting, but I believe Databricks and OSS UC APIs will converge on the Iceberg catalog APIs, one day leaving the Unity APIs behind. Remember convergence on the catalog APIs is much different than convergence of Delta and Iceberg table formats themselves, which I believe will remain distinct.
In the catalog space, there is a large feature divide between “metastores” and “business catalogs”. The Databricks product Unity Catalog is the only one in my mind that, in addition to being a full fledged metastore, is close to pulling off and fulfilling many of the use cases needed from a business catalog. The main gap I see is around the ecosystem completeness and breadth of connectors. If you have a large Databricks product gravity center in your architecture it will work, but if Databricks is only a side-play in your architecture, evaluate with care.
If you need a metastore and want something more fulfilling than the Hive Metastore, maybe consider Unity Catalog OSS. I would be tempted to use this if I had a strong need to leverage the access control mechanisms it offers, but the tradeoff is the limited query engines that are supported so plan ahead accordingly. If you need features that a business catalog has to offer, you might need to supplement UC with another catalog.
Apache Polaris (Incubating) is still <1yrs old, so the current feature set and experience is very limited and will likely rapidly improve over the years. With Iceberg’s hard dependency on a data catalog, if you use Iceberg today, what other catalog are you using today? If you want to go down a self-managed OSS only route and your entire estate is Iceberg only, then consider Polaris for the good access control features. If you need features that a business catalog has to offer, you might need to supplement Polaris with another catalog.
If you have a diverse data estate with a sprawl of different tools and are looking for more of a business catalog to help all roles in your organization collaborate and leverage your data, then I would recommend you consider DataHub or Atlan. If you want an open source focused route, DataHub is your sure ticket with a strong community. If you don’t want to self-host DIY, you can also rely on Acryl for a managed solution. If you have a large gravity center around data lakes and you need stronger governance, you may need to supplement this with another metastore.
Atlan product experiences are beautiful. If you have similar goals like I described with DataHub above, and you want simplicity and easy productivity for your entire organization, Atlan is a strong choice. Atlan functionality seems to be strongest if data warehouses are the gravity center in your organization, but maybe weaker if a data lake is your focus where you may need a supplementary metastore. There are many other strong business catalogs that I did not cover in my analysis. So if you are here and leaning towards Atlan, I would recommend you also should investigate Alation, Collibra, or Informatica in parallel for comparison.
If you are on AWS, using some of their managed analytics services, then Glue is a no-brainer. It is so easy and it seamlessly falls into the background when you are stitching together solutions between AWS services. The only question people may wonder here is if they should use one of the other metastores listed here instead of Glue? In my opinion, no… unless you have a tough multi-cloud architecture. Most of the features that are missing for Glue are because another AWS solution fills those gaps. An example of a large gap for Glue is access control, which AWS LakeFormation takes care of even better than the other catalogs analyzed here. Even under multi-cloud architecture, these metastore limitations are also covered with GCP BigLake and Azure OneLake Catalog.
AWS Glue is essentially a managed Hive Metastore. Despite the negative sentiment it receives as mentioned above you can still use modern table formats, Hudi, Iceberg, and Delta and be free from the limitations of Hive. The Hive Metastore is still in my opinion the most sure shot, easy choice for a bare bones metastore. You need to get a data lake up and running and don’t need many frills, grab an HMS and you can supplement it with a business catalog later.
In 2025 as the ecosystem of catalogs continues to grow and mature, one of the dangers we need to collectively avoid as a community is vendor lockin. What if I need my metadata available in multiple catalogs or I decide to switch one day? What if I author all of my access control policies in one catalog, will it be possible to reuse these policies in another? What is your 5 year prediction from today? Do you think 1 catalog will eventually rise as the only one we need, or do you think we might need something like an “XCatalog” similar to how XTable can translate and sync between table formats?
As mentioned previously I don’t think there is one single “winner” or a single best catalog for every use case today. I hope my research here serves as a starting point for you to do your own research and make a choice tailored to your unique data architecture and goals.
I am certain I have made human errors in my analysis here. If you spot something I got wrong, or you want me to add another section, or feature, or catalog into the comparison, drop me a line: kyle@onehouse.ai and I will try to add updates as I can.
At Onehouse we embrace all of these catalogs and offer a unique Multi-Catalog Sync utility that ensures all of your lakehouse tables are available to any catalog and query engine of your choice. If you want to talk about catalogs, query engines, table formats, or any other topic, then reach out to us at gtm@onehouse.ai.
Be the first to read new posts