Introducing the Pythonic Iceberg REST Catalog

Jun 17, 2024

What an interesting time to be in the data infra space…

In the last few weeks

Tabular (tabular.io) was acquired by Databrick’s (https://www.databricks.com/blog/databricks-tabular)
Snowflake announced Polaris, an open-sourced Iceberg REST catalog implementation (https://www.snowflake.com/blog/introducing-polaris-catalog/)
Databrick’s open-sourced Unity Catalog, which also supports the Iceberg REST spec (https://www.databricks.com/blog/open-sourcing-unity-catalog)
Nessie will integrate with the Iceberg REST spec (https://projectnessie.org/blog/2024/05/13/nessie-integration-of-iceberg-rest/)
Even Hive is joining the party by providing an Iceberg REST catalog interface (https://github.com/apache/hive/pull/5145)

The catalog layer has been top of mind for me in the last few years. Managing the Hive Metastore at scale to support Hive and Iceberg for multi-engine environments has been challenging and rewarding at the same time.

I believe the Iceberg REST catalog is a data-infra game-changer and will inspire the next generation of innovation. If you’re thinking about using the Iceberg REST catalog, check out this previous article on why it can be a good bet for your organization (4 Reasons to Choose the Iceberg REST Catalog).

The Open Sourced Iceberg REST Catalog Spec

The beauty of the Iceberg REST Catalog spec is that it’s publicly available for all. It’s here in the Iceberg repo, https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml. And the OpenAPI UI documentation is hosted here.

Anyone can develop a compliance service and be plugged into the ecosystem that supports Iceberg REST.

Another Iceberg REST Catalog Implementation

The Pythonic Iceberg REST Catalog service is something I’ve been working on. It’s based on the Iceberg REST spec and the Pyiceberg library.

It is open-sourced in this repo (https://github.com/kevinjqliu/iceberg-rest-catalog). The goal is the provide an open-source reference implementation, written in Python, so that the community can quickly prototype with the REST catalog.

I announced this last week on LinkedIn. https://www.linkedin.com/posts/kevinjqliu_github-kevinjqliuiceberg-rest-catalog-activity-7204199613711417346-R7_2/

Production Service

What’s better than a reference implementation? A production service endpoint!

The REST Catalog service is deployed using Modal and is publicly accessible.

Here’s the URL

<redacted>/v1/namespaces

Here are some URl’s to interact with the REST endpoint

List all namespaces
- <redacted>/v1/namespaces
Show a specific namespace, `nyc_taxi`
- <redacted>/v1/namespaces/nyc_taxi
List all tables in the namespace
- <redacted>/v1/namespaces/nyc_taxi/tables
Show a specific table
- <redacted>/v1/namespaces/nyc_taxi/tables/yellow_tripdata

Integrations

PyIceberg

```

pyiceberg --uri <redacted> list

```

Trino

Add an Iceberg connector

https://trino.io/docs/current/object-storage/metastores.html#iceberg-rest-catalog

```

connector.name=iceberg

iceberg.catalog.type=rest

iceberg.rest-catalog.uri=<redacted>

```

Spark

Pyspark configs

https://iceberg.apache.org/docs/1.5.0/spark-configuration/#catalogs

```

spark = (

SparkSession.builder.appName("IcebergExample")

.config("spark.jars", f"{iceberg_jar_path},{aws_sdk_s3_jar_path}")

.config(

"spark.sql.extensions",

"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",

)

.config("spark.sql.catalog.rest", "org.apache.iceberg.spark.SparkCatalog")

.config("spark.sql.catalog.rest.type", "rest")

.config("spark.sql.catalog.rest.uri", "<redacted>")

.config("spark.sql.catalog.rest.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")

.config("spark.sql.catalog.rest.warehouse", "s3://warehouse/rest/")

.config("spark.sql.catalog.rest.s3.endpoint", "http://127.0.0.1:9000")

.config("spark.sql.defaultCatalog", "rest")

.config("spark.sql.catalogImplementation", "in-memory")

.getOrCreate()

)

```

About the data inside

The table `nyc_taxi.yellow_tripdata` is registered with an S3 file from the NYC TLC dataset, an Open Data initiative (https://aws.amazon.com/marketplace/pp/prodview-okyonroqg5b2u#overview).

The entire catalog is independent of the underlying data. The catalog service is deployed in Modal. The metadata is saved on my publicly open S3 bucket (s3://iceberg-rest-catalog/).

Fin

If you’re thinking about Iceberg, REST catalog, Open Table Format (OTF), or anything related, feel free to reach out to me. I would love to chat.

…

P.S. If you’re in the Seattle area, we’re having our second Seattle Iceberg Meetup on June 25th! More details at https://sites.google.com/view/icebergmeetup.

kevin’s Substack

Discussion about this post