Skip to content

fix(arrow_dataset): initialise writer on first non-None map result, not on i==0#8054

Open
s-zx wants to merge 1 commit intohuggingface:mainfrom
s-zx:fix/7990-map-writer-init-on-first-nonnull
Open

fix(arrow_dataset): initialise writer on first non-None map result, not on i==0#8054
s-zx wants to merge 1 commit intohuggingface:mainfrom
s-zx:fix/7990-map-writer-init-on-first-nonnull

Conversation

@s-zx
Copy link
Copy Markdown
Contributor

@s-zx s-zx commented Mar 8, 2026

Summary

Dataset.map crashes with AttributeError: 'NoneType' object has no attribute 'write' when the map function returns None for the first N examples and a dict for later ones (issue #7990).

Root Cause

The ArrowWriter was initialised with a guard tied to the example index:

# Non-batched path (line ~3736)
if i == 0:
    buf_writer, writer, tmp_file = init_buffer_and_writer()

# Batched path (line ~3762)
if i and i[0] == 0:
    buf_writer, writer, tmp_file = init_buffer_and_writer()

update_data is set lazily via prepare_outputs(): when the first example returns None, update_data = False and no writer is created. When example i=2 returns a dict, update_data flips to True, but since i != 0 the writer initialisation is skipped. The subsequent writer.write(example) call then crashes because writer is None.

Fix

Replace the index-based guards with if writer is None:

# Non-batched path
if writer is None:
    buf_writer, writer, tmp_file = init_buffer_and_writer()

# Batched path
if writer is None:
    buf_writer, writer, tmp_file = init_buffer_and_writer()

The writer is now created on the first example/batch that actually produces output, regardless of its position.

Tests

Added test_map_writer_initialized_when_first_examples_return_none to tests/test_arrow_dataset.py: applies a map function that returns None for indices 0–1 and a transformed dict for the rest, verifying the call completes without error and produces correct output.

Fixes #7990

…ot on i==0

`Dataset.map` initialised the ArrowWriter only when `i == 0` for the
non-batched path and `i[0] == 0` for the batched path.  When the map
function returns `None` for the first few examples, `update_data` stays
`False` and no writer is created.  As soon as a later example returns a
dict, `update_data` becomes `True`, but the writer is still `None`,
causing:

    AttributeError: 'NoneType' object has no attribute 'write'

Fix: replace `if i == 0` / `if i and i[0] == 0` with `if writer is None`
so the writer is initialised on the first non-None output regardless of
the example index.

Add regression test `test_map_writer_initialized_when_first_examples_return_none`
that applies a map function returning None for the first two examples
and a dict for the rest, asserting it completes without error.

Fixes huggingface#7990
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant