Authors: Shivam Banga
For a Data engineer/Admins, making sure that all operations run smoothly is a priority.
That’s where Unity Catalog can help ensure that stored information is managed correctly, especially for those working with Azure Databricks. The Unity Catalog (UC) is a powerful metadata management system that is built into Delta Lake. It provides a centralized location to help users manage the metadata information on the data stored in Delta Lake. It also helps to simplify data management by providing a unified view of data across different data sources and formats.
Before Unity Catalog, every ADB (Azure Databricks) workspace had its own metastore, user management, and access controls, which led to duplication of efforts when maintaining consistency across all workspaces. To overcome these challenges, Databricks developed Unity Catalog, a unified governance solution for data and AI assets on the Lakehouse. Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Azure Databricks workspaces.
At Tiger Analytics, we’ve worked to enable Unity Catalog for clients with new Databricks deployment and upgrading existing hive metastore to Unity Catalog to leverage all benefits Unity Catalog provides.
Making the Most of Unity Catalog’s Key Features
Centralized metadata and user management
Unity Catalog provides a centralized metadata layer to enable sharing data objects such as catalogs/ schema/ tables across multiple workspaces. It introduces two new built-in admin roles (Account Admins and Metastore Admins) to manage key features.
– Account Admin: manages account-level resources like metastore, assigns metastore to workspaces, and assigns principals to the workspace.
– Metastore Admin: manages metastore objects and grants identities access to securable objects (catalog/ schema/ tables/ views).
Centralized data access controls
Unity Catalog permits the use of Standard SQL-based commands to provide access to data objects.
GRANT USE CATALOG ON CATALOG < catalog_name > TO < group_name >;
GRANT USE SCHEMA ON SCHEMA < catalog_name >.< schema_name >
TO < group_name >;
ON < catalog_name >.< schema_name >.< table_name >
TO < group_name >;
Data lineage Data access auditing
Unity Catalog automatically captures user-level audit logs that record access to user data. It also captures lineage data that tracks how data assets are created and used across all languages and personas.
Data search and discovery
Unity Catalog lets you tag and document data assets and provides a search interface to help data consumers find data.
Unity Catalog allows users in Databricks to share data securely outside the organization, which can be managed, governed, audited, and tracked.
Managing Users and Access Control
– Account admins can sync users (groups) to workspaces from Azure Active Directory (Azure AD) tenant to Azure Databricks account using a SCIM provisioning connector.
– Azure Databricks recommends using account-level SCIM provisioning to create, update, and delete all users (groups) from the account.
Unity Catalog Objects
A metastore is the top-level container of objects in the Unity Catalog. It stores data assets (tables and views) and the permissions that govern access. UC metastore is mapped to an ADLS container, this container stores the Unity Catalog metastore’s metadata and managed tables. You can only create one UC metastore per region. Each workspace can only be attached to one UC metastore at any point in time. Unity Catalog has a 3-tier structure (catalog.schema.table/view) for referencing objects.
External Location and Storage Credential
– Storage credential created either as managed identity or service principal provides access to the underlying ADLS path.
– Storage credentials (managed identity/ service principal) should be authorized to that external storage account location by providing IAM role at storage account level.
– External Location is an object that combines a cloud storage path with storage credentials to authorize access to the cloud storage path.
– Each cloud storage path can be associated with only one external location. If you attempt to create a second external location that references the same path, the command fails.
Managed and External Tables
– Unity Catalog manages the lifecycle of managed tables. This means that if you drop managed tables, both metadata and data are dropped.
– By default, UC metastore ADLS container (the root storage location) will store the managed tables’ data as well, but you can override this default location at the catalog or schema level. Managed tables are in Delta format only.
– External tables are tables whose data is stored outside of the managed storage location specified for the metastore, catalog, or schema. Dropping them will only delete the metadata of the table.
How to Create a UC Metastore and Link Workspaces
This flow diagram explains the sub-tasks needed to create a Metastore.
Step 1: Create an ADLS storage account and container
This Storage account container will store Unity Catalog metastore’s metadata and managed tables.
Step 2: Create an access connector for Databricks
Create Access Connector for Azure Databricks, and when deployment is done, make a note of Resource ID.
Step 3: Provide RBAC to access the connector
Add role assignment: Storage Blob Data Contributor to the managed identity (access connector) in Step#2
Step 4: Create metastore and assign workspaces
Once a UC metastore has been attached to a workspace, this will be visible under the workspace data tab:
If Unity Catalog is enabled for any existing workspace which had tables stored under hive_metastore catalog, those existing tables can be upgraded using SYNC command or UI, or they can be accessed using hive_metastore.<schema_name>.<table_name>
Enabling Unity Catalog as part of lakehouse architecture helps in achieving a centralized metadata layer for more enterprise-level governance without sacrificing the ability to manage and share data effectively. It helps in planning workspace deployments with limits in mind. This helps eliminate the risks of not being able to share the data and govern the project. With Unity Catalog, we can overcome the limitations and constraints of the existing Hive metastore, enabling us to better collaborate and leverage the power of data according to specific business needs.