What an interesting time to be in the data infra space…
In the last few weeks
Tabular (tabular.io) was acquired by Databrick’s (https://www.databricks.com/blog/databricks-tabular)
Snowflake announced Polaris, an open-sourced Iceberg REST catalog implementation (https://www.snowflake.com/blog/introducing-polaris-catalog/)
Databrick’s open-sourced Unity Catalog, which also supports the Iceberg REST spec (https://www.databricks.com/blog/open-sourcing-unity-catalog)
Nessie will integrate with the Iceberg REST spec (https://projectnessie.org/blog/2024/05/13/nessie-integration-of-iceberg-rest/)
Even Hive is joining the party by providing an Iceberg REST catalog interface (https://github.com/apache/hive/pull/5145)
The catalog layer has been top of mind for me in the last few years. Managing the Hive Metastore at scale to support Hive and Iceberg for multi-engine environments has been challenging and rewarding at the same time.
I believe the Iceberg REST catalog is a data-infra game-changer and will inspire the next generation of innovation. If you’re thinking about using the Iceberg REST catalog, check out this previous article on why it can be a good bet for your organization (4 Reasons to Choose the Iceberg REST Catalog).
The Open Sourced Iceberg REST Catalog Spec
The beauty of the Iceberg REST Catalog spec is that it’s publicly available for all. It’s here in the Iceberg repo, https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml. And the OpenAPI UI documentation is hosted here.
Anyone can develop a compliance service and be plugged into the ecosystem that supports Iceberg REST.
Another Iceberg REST Catalog Implementation
The Pythonic Iceberg REST Catalog service is something I’ve been working on. It’s based on the Iceberg REST spec and the Pyiceberg library.
It is open-sourced in this repo (https://github.com/kevinjqliu/iceberg-rest-catalog). The goal is the provide an open-source reference implementation, written in Python, so that the community can quickly prototype with the REST catalog.
I announced this last week on LinkedIn. https://www.linkedin.com/posts/kevinjqliu_github-kevinjqliuiceberg-rest-catalog-activity-7204199613711417346-R7_2/
Production Service
What’s better than a reference implementation? A production service endpoint!
The REST Catalog service is deployed using Modal and is publicly accessible.
Here’s the URL
<redacted>/v1/namespaces
Here are some URl’s to interact with the REST endpoint
List all namespaces
<redacted>/v1/namespaces
Show a specific namespace, `nyc_taxi`
<redacted>/v1/namespaces/nyc_taxi
List all tables in the namespace
<redacted>/v1/namespaces/nyc_taxi/tables
Show a specific table
<redacted>/v1/namespaces/nyc_taxi/tables/yellow_tripdata
Integrations
PyIceberg
```
pyiceberg --uri <redacted> list
```
Trino
Add an Iceberg connector
https://trino.io/docs/current/object-storage/metastores.html#iceberg-rest-catalog
```
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=<redacted>
```
Spark
Pyspark configs
https://iceberg.apache.org/docs/1.5.0/spark-configuration/#catalogs
```
spark = (
SparkSession.builder.appName("IcebergExample")
.config("spark.jars", f"{iceberg_jar_path},{aws_sdk_s3_jar_path}")
.config(
"spark.sql.extensions",
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
)
.config("spark.sql.catalog.rest", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.rest.type", "rest")
.config("spark.sql.catalog.rest.uri", "<redacted>")
.config("spark.sql.catalog.rest.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
.config("spark.sql.catalog.rest.warehouse", "s3://warehouse/rest/")
.config("spark.sql.catalog.rest.s3.endpoint", "http://127.0.0.1:9000")
.config("spark.sql.defaultCatalog", "rest")
.config("spark.sql.catalogImplementation", "in-memory")
.getOrCreate()
)
```
About the data inside
The table `nyc_taxi.yellow_tripdata` is registered with an S3 file from the NYC TLC dataset, an Open Data initiative (https://aws.amazon.com/marketplace/pp/prodview-okyonroqg5b2u#overview).
The entire catalog is independent of the underlying data. The catalog service is deployed in Modal. The metadata is saved on my publicly open S3 bucket (s3://iceberg-rest-catalog/).
Fin
If you’re thinking about Iceberg, REST catalog, Open Table Format (OTF), or anything related, feel free to reach out to me. I would love to chat.
…
P.S. If you’re in the Seattle area, we’re having our second Seattle Iceberg Meetup on June 25th! More details at https://sites.google.com/view/icebergmeetup.