Iceberg REST Catalog with Hive Metastore
This post showcases a way to set up the Iceberg REST Catalog using the HiveCatalog implementation and backed by the Hive Metastore.
The post goes into detail about the HOW. There will be another post on the WHY.
Components
Hive Metastore
Set up the Hive Metastore using docker. Expose the Hive Metastore port (default 9083).
There are a number of tutorials online to run a docker container with Hive Metastore. For this demo, we’re going to use recap-build/hive-metastore-standalone since it was the easiest to set up.
Hive Metastore Client
Interact with the Hive Metastore to read its content and verify that it's running properly.
We’re going to use recap-build/pymetastore to query the Hive Metastore.
Iceberg REST Catalog
Set up the Iceberg REST Catalog server using docker. Expose the REST port (default 8181)
Tabular provides a way to run the Iceberg REST Catalog in docker with a configurable backing catalog, tabular-io/iceberg-rest-image. There are provided examples of using the GlueCatalog and in-memory JdbcCatalog. We’re going to configure it to run the HiveCatalog.
Iceberg REST Catalog Client
Interact with the Iceberg REST Catalog to read its content. Verify that it’s returning the same result as the Hive Metastore Client.
We’re going to use pyiceberg (apache/iceberg/python) to query the Iceberg REST Catalog.
The Setup
Step 1: Run Hive Metastore
docker run -p 9083:9083 ghcr.io/criccomini/hive-metastore-standalone:latest
Step 2: Run Hive Metastore Client
Requires `pymetastore` python library
pip install pymetastore
from pymetastore.metastore import HMS
with HMS.create(host="localhost", port=9083) as hms:
databases = hms.list_databases()
database = hms.get_database(name="default")
tables = hms.list_tables(database_name=database.name)
print(databases, database, tables)
Step 3: Run Iceberg REST Catalog
As of writing, requires the branch from my repo, kevinjqliu:kevinjqliu/run-hive-catalog
Clone the repo and spin up the docker container using
docker-compose up --build
Step 4: Run Iceberg REST Catalog
Requires the `pyiceberg` python library
pip install pyiceberg
pyiceberg --uri http://localhost:8181 list
Iceberg REST Endpoints
Testing out different endpoints
pyiceberg --uri http://localhost:8181 list
pyiceberg --uri http://localhost:8181 list default
In browser,
The Code
Iceberg REST Catalog using HiveCatalog backed by Hive Metastore #43
Dev log: REST Catalog on HMS
Next Steps
This post shows the minimal viable components needed to set up an Iceberg REST Catalog server using the HiveCatalog implementation and backed by the Hive Metastore.
There are a number of changes to the components to make this example more production-like.
The Hive Metastore is currently running with embedded Derby for DB and local filesystem. Derby DB can only process one connection from the metastore at a time. Future work can replace the backing DB to use a more durable DB such as Postgres or MySQL. Storage should also be replaced with an S3-compatible layer such as minio.
I’d like to integrate the Iceberg REST Catalog with the rest of the data ecosystem, especially Spark and Trino. And test out reading and writing tables using the REST catalog with various processing engines.
Disclaimer
I am not very familiar with docker and gradle. So if there’s a better way to do this, please do let me know.