Add Apache Iceberg format support by frankliee · Pull Request #8148 · huggingface/datasets

frankliee · 2026-04-23T11:29:11Z

Add Apache Iceberg format support

Motivation

Apache Iceberg is the most widely adopted open table format for data lakes, supported by Databricks,
Snowflake, AWS Glue, Dremio, and others. A large amount of ML training data lives in Iceberg tables.
Currently, users must manually export Iceberg data to Parquet before loading it into HuggingFace Datasets —
this PR removes that friction.

fix this (#7863)

Usage

Users pass a pre-configured pyiceberg Catalog object and a table identifier:

from pyiceberg.catalog.sql import SqlCatalog
from datasets import load_dataset

catalog = SqlCatalog("my_catalog", uri="sqlite:///catalog.db", warehouse="/tmp/warehouse")

Basic loading

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table")

Column selection + row filtering (predicate pushdown)

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table",
columns=["text", "label"],
filters=[("label", ">", 0)])

Multiple splits from different tables

ds = load_dataset("iceberg", catalog=catalog,
table={"train": "db.train", "test": "db.test"})

Time travel via snapshot_id

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table",
snapshot_id=7051729674881785648)

Streaming

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table", streaming=True)

Works with any pyiceberg-supported catalog backend (REST, Hive, Glue, SQL, etc.) — the builder is agnostic
to how the catalog is configured.

Design decisions

Catalog object passed in, not constructed internally. Iceberg catalog configuration varies widely across
backends (REST, Hive, Glue, SQL each have different auth/connection params). Rather than re-implementing a
"catalog factory" inside the builder, users bring their own catalog — similar to how the sql builder accepts
an existing SQLAlchemy connection. This keeps the builder simple and forward-compatible with new catalog
types.
No _EXTENSION_TO_MODULE registration. Unlike file-based formats (Parquet, Lance, CSV), Iceberg tables are
addressed via catalog + table identifier, not file extensions. Users must specify "iceberg" explicitly as
the path argument.
create_config_id override for fingerprinting. Catalog objects (containing SQLAlchemy engines, connection
pools, etc.) are not picklable by dill. The override replaces the catalog with a stable string
representation ("{ClassName}_{name}") before hashing.
_CountableBuilderMixin for fast row counting. Uses scan.plan_files() metadata to count rows without
reading data files.

lhoestq

Awesome ! I just have one comment:

lhoestq · 2026-04-24T14:21:51Z

+            splits.append(
+                datasets.SplitGenerator(
+                    name=split_name,
+                    gen_kwargs={"scan": scan},


Do you think we can have a list here instead ? This would enable parallel processing/streaming

e.g. one scan object per file maybe

HuggingFaceDocBuilderDev · 2026-04-24T14:25:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

frankliee force-pushed the iceberg branch from 80598bd to c567422 Compare April 23, 2026 11:30

Add Apache Iceberg format support

62ee472

frankliee force-pushed the iceberg branch from c567422 to 62ee472 Compare April 23, 2026 12:18

frankliee changed the title ~~[WIP] Add Apache Iceberg format support~~ Apr 23, 2026

frankliee mentioned this pull request Apr 23, 2026

Support hosting lance / vortex / iceberg / zarr datasets on huggingface hub #7863

Open

lhoestq reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Apache Iceberg format support#8148

Add Apache Iceberg format support#8148
frankliee wants to merge 1 commit intohuggingface:mainfrom
frankliee:iceberg

frankliee commented Apr 23, 2026 •

edited

Loading

lhoestq left a comment

lhoestq Apr 24, 2026

HuggingFaceDocBuilderDev commented Apr 24, 2026

Labels

3 participants

Conversation

frankliee commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!