What is Open Table Format

What’s a table format?

A good way to define a table format is a way to organize a dataset’s files to present them as a single “table”. The primary goal of a table format is to provide the abstraction of a table to people and tools and allow them to efficiently interact with that table’s underlying data.

Table formats are nothing new — they’ve been around since System R, Multics, and Oracle first implemented Edgar Codd’s relational model, although “table format” wasn’t the term used at the time. All interaction with the underlying data, like writing or reading it, was handled by the database’s storage engine. No other engine could interact with the files directly without corrupting the system.

Apache Hive is one of the earliest and most used table formats in big data world, it stores the metadata information(like schema) of the datasets in HMS(Hive Metastore, backed by a central realtional DB), and provides SQL programming model to query the datasets as a table, the Hive table format has been the de facto standard ever since, but it has the following drawbacks when used at larger scale:

What is open table format?

To address those problems, communities started creating new open table formats:

Though organized in different layouts, they all provide Lakehouse features:

References:

[1] Apache Iceberg: An Architectural Look Under the Covers
[2] Open Table Formats — Delta, Iceberg & Hudi