4 Reasons to Choose the Iceberg REST Catalog

Why the REST Catalog can be a great choice for your organization

Mar 13, 2024

The previous post, Iceberg REST Catalog with Hive Metastore, showed how to set up the Iceberg REST Catalog backed by the Hive Metastore. In this post, we’ll dive deep into the motivation behind this architecture and go over some of the pros and cons of this approach.

Overview

As organizations adopt Apache Iceberg into their data stack, picking the right catalog implementation should become top of mind. The REST Catalog deployment is especially important given recent developments in the Open Table Format space.

So why should organizations adopt the REST Catalog in their data infrastructure?

This article will cover these 4 reasons to use the REST Catalog

ease of setup, either as a new deployment or on top of an existing catalog deployment
extendability, with enhancements such as scaling and telemetry
ability to decouple infrastructure by providing a layer of indirection
avoid vendor lock-in

On Iceberg and the Catalogs

Apache Iceberg is an open table format for analytical (OLAP) use cases. It was created to supersede the Hive table format, a widely used table format for big data workloads. Iceberg defines a specification for storing data and is interoperable with many different compute engines. Some of the most common compute engines, such as Spark, Trino, Flink, Hive, and Dremio, support reading and writing Iceberg tables. Support from other engines and vendors, such as DuckDB, Snowflake, and Databricks, is in progress as well. Iceberg’s interoperability with all the different engines and vendors exemplifies its “open” ethos and breaks the cycle of vendor lock-in. With Iceberg, organizations can write once and read everywhere.

Iceberg is opinionated about its deployment, requiring a catalog to keep track of tables and metadata. To use the Iceberg format, compute engines communicate with the catalog to read and write tables. As stated in the Iceberg Catalogs documentation, compute engines using a shared Iceberg catalog allows them to share a common data layer.” The Iceberg catalog is therefore the central piece of infrastructure that allows all compute engines to operate on the same table.

The Iceberg project has many catalog implementations available to choose from. Some of the most common catalog implementations are Hive Metastore, Glue, DynamoDB, JDBC, Hadoop, Nessie, and REST. Vendors also provide different catalog implementations, such as AWS Glue Catalog, Google BigLake Metastore, Tabular REST Catalog, Snowflake Iceberg Catalog, and Databricks Unity Catalog. With so many catalog implementations, choosing a catalog seems daunting and can depend on many factors such as familiarity, existing infrastructure, and ecosystem.

Organizations with existing Hive infrastructure may choose to use the Hive Metastore as their Iceberg Catalog. Those in the AWS ecosystem can use the Glue catalog which has native integration with AWS compute engines like Athena and EMR. The same goes for those running on top of GCP, Azure/Databricks, or Snowflake.

Most vendors are integrating Iceberg across the stack within their ecosystem. So why should organizations pick the REST catalog instead?

Reasons to pick the REST Catalog

Easy to Set Up

REST Catalog is easy to set up, either as a new deployment or on top of an existing catalog deployment.

The REST Catalog implementation follows the client/server architecture. Implementation details live on the catalog server. Clients follow the REST specification to communicate over REST/HTTP protocol.

The REST catalog can be quickly spun up as an HTTP wrapper around the underlying catalog. Its main purpose is to translate HTTP requests into Iceberg operations for the underlying catalog. The REST catalog specification can be found in the public Iceberg repo (link). Tabular provides a reference REST Catalog implementation in this public repo (link). See Iceberg REST Catalog with Hive Metastore as an example of setting up the REST Catalog wrapped around the Hive Metastore catalog.

Alternatively, companies can go the vendor route and use the Tabular provided REST catalog.

Extendability

REST Catalog can be extendable, with enhancements such as scaling and telemetry.

With the REST Catalog implementation, organizations are free to implement features on top of the ones supported by the REST specification. One such feature can be telemetry for Iceberg usage. API calls to the REST Catalog can be recorded for usage analysis and auditability. This comes in handy to determine usage patterns and for debugging. The Tabular cookbook listed many other features such as server-side commit deconfliction and retries, multi-table commits, and credential vending, among many others.

REST Catalog can be extended to provide horizontal scaling. Using an API gateway, HTTP requests can be routed to many instances of the underlying catalog. This is useful to prevent noisy neighbors and limit the blast radius from any single catalog instance failure.

Decouple Infrastructure

REST Catalog helps decouple infrastructure by providing a layer of indirection.

REST Catalog’s client/server architecture and HTTP protocol can be used to decouple implementation details from both clients and catalogs.

HTTP is language agnostic. The Iceberg library, and many of its catalog implementations, are primarily written in Java. Although other language implementations are available, such as Python and Rust, Java is predominately chosen for Iceberg deployment. By using HTTP to communicate, the REST Catalog can be used by any programming language that supports HTTP.

REST Catalog also decouples catalog implementation in compute engines. Traditionally, compute engines need to support many catalog implementations to cover all the bases. For example, Trino supports Hive Metastore, Glue, JDBC, Nessie, and REST Catalog implementation (link). Instead of reimplementing the catalog logic over and over for each specific implementation, Trino can just use the REST implementation as the common interface. All the other catalog implementations can be wrapped with REST Catalog to speak the same protocol. New engines should consider this approach instead of supporting a multitude of catalog implementations.

Avoid vendor lock-in

REST Catalog can help avoid vendor lock-in.

Some catalog implementations tend to favor a particular vendor. The REST Catalog can be used to level the playing field; it decouples the catalog from the underlying implementation. With the REST Catalog, features and behaviors are standardized by the REST Catalog specification, which provides a set of APIs as the common interface. This encourages all vendors to support a single standard implementation instead of each different catalog implementation.

The Rest Catalog will also work with any catalog implementations, by “wrapping” the underlying catalog with the REST specification and protocol. Organizations can “Bring Your Own Catalog”, wrap it in the REST protocol, and become interoperable with any compute engines that also speak the REST protocol. The composability provides organizations with flexibility when picking a catalog.

Reasons to not choose the REST Catalog (yet)

As described above, the REST Catalog provides several benefits for any Iceberg deployment. However, it may not be necessary in certain situations.

Organizations may choose one of the “out of the box” catalogs. For example, the Glue catalog can be spun up with a few clicks of the button. On the other hand, the REST Catalog adds an extra layer of complexity compared to other catalogs. Until it is necessary to solve one of the issues above, the catalogs that are easier to deploy can provide more value immediately.

The extra complexity of the REST Catalog can come from running it as a service. Data teams must adhere to the rigor and responsibility of running a production service. This is especially true when running a catalog service, which is the central nervous system for the entire data infrastructure. Downtime in the catalog service can disrupt the entire business since data can't be read or written to.

As with any software, premature optimization is the root of all evil. Keep the benefits of the REST Catalog in mind, and defer until it can provide clear value.

Fin

Apache Iceberg has been an exciting development in the big data ecosystem. I hope this post sheds some light on the REST Catalog and the value it provides.

I believe that the open table format needs an open catalog implementation. The REST Catalog specification can help achieve that goal.

…

If you’re working on Iceberg, open table format, or anything else in the data ecosystem, drop me a line. I would love to chat about it. I also plan to write more on Iceberg, so look out for more posts from this Substack!

And lastly, the Iceberg Summit is on May 14-15. It is free and available online. Register here.

Ajantha Bhat

Mar 14, 2024

The Open source Iceberg doesn't provide an open source REST catalog server implementation. Don't you think it is concerning? As an open source user, I have to depend on some vendor for the REST catalog implementation. So, how it the best choice?

Expand full comment

1 reply by Kevin Liu

1 more comment...

kevin’s Substack

Discussion about this post

Ready for more?