Skip to content

Add Apache Iceberg format support#8148

Open
frankliee wants to merge 1 commit intohuggingface:mainfrom
frankliee:iceberg
Open

Add Apache Iceberg format support#8148
frankliee wants to merge 1 commit intohuggingface:mainfrom
frankliee:iceberg

Conversation

@frankliee
Copy link
Copy Markdown

@frankliee frankliee commented Apr 23, 2026

Add Apache Iceberg format support

Motivation

Apache Iceberg is the most widely adopted open table format for data lakes, supported by Databricks,
Snowflake, AWS Glue, Dremio, and others. A large amount of ML training data lives in Iceberg tables.
Currently, users must manually export Iceberg data to Parquet before loading it into HuggingFace Datasets —
this PR removes that friction.

fix this (#7863)

Usage

Users pass a pre-configured pyiceberg Catalog object and a table identifier:

from pyiceberg.catalog.sql import SqlCatalog
from datasets import load_dataset

catalog = SqlCatalog("my_catalog", uri="sqlite:///catalog.db", warehouse="/tmp/warehouse")

Basic loading

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table")

Column selection + row filtering (predicate pushdown)

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table",
columns=["text", "label"],
filters=[("label", ">", 0)])

Multiple splits from different tables

ds = load_dataset("iceberg", catalog=catalog,
table={"train": "db.train", "test": "db.test"})

Time travel via snapshot_id

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table",
snapshot_id=7051729674881785648)

Streaming

ds = load_dataset("iceberg", catalog=catalog, table="db.my_table", streaming=True)

Works with any pyiceberg-supported catalog backend (REST, Hive, Glue, SQL, etc.) — the builder is agnostic
to how the catalog is configured.

Design decisions

  • Catalog object passed in, not constructed internally. Iceberg catalog configuration varies widely across
    backends (REST, Hive, Glue, SQL each have different auth/connection params). Rather than re-implementing a
    "catalog factory" inside the builder, users bring their own catalog — similar to how the sql builder accepts
    an existing SQLAlchemy connection. This keeps the builder simple and forward-compatible with new catalog
    types.
  • No _EXTENSION_TO_MODULE registration. Unlike file-based formats (Parquet, Lance, CSV), Iceberg tables are
    addressed via catalog + table identifier, not file extensions. Users must specify "iceberg" explicitly as
    the path argument.
  • create_config_id override for fingerprinting. Catalog objects (containing SQLAlchemy engines, connection
    pools, etc.) are not picklable by dill. The override replaces the catalog with a stable string
    representation ("{ClassName}_{name}") before hashing.
  • _CountableBuilderMixin for fast row counting. Uses scan.plan_files() metadata to count rows without
    reading data files.
Copy link
Copy Markdown
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome ! I just have one comment:

splits.append(
datasets.SplitGenerator(
name=split_name,
gen_kwargs={"scan": scan},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can have a list here instead ? This would enable parallel processing/streaming

e.g. one scan object per file maybe

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants