Skip to content

fix: tolerate chmod PermissionError on flat-permission filesystems#8146

Open
dschulmeist wants to merge 1 commit intohuggingface:mainfrom
dschulmeist:fix/chmod-best-effort-flat-filesystems-8125
Open

fix: tolerate chmod PermissionError on flat-permission filesystems#8146
dschulmeist wants to merge 1 commit intohuggingface:mainfrom
dschulmeist:fix/chmod-best-effort-flat-filesystems-8125

Conversation

@dschulmeist
Copy link
Copy Markdown

Summary

Fixes #8125.

Dataset.map() (and the indices writer in Dataset.flatten_indices / Dataset.select path) calls os.chmod(cache_file, 0o666 & ~umask) on the freshly moved cache file. On flat-permission filesystems — GCS FUSE, S3 FUSE mounts, and similar object-storage-backed mounts — chmod raises PermissionError (a subclass of OSError). The cache file itself is already written successfully at that point, but the unguarded chmod takes down the whole .map() call with a traceback like:

  File "src/datasets/arrow_dataset.py", line 3747, in _map_single
    os.chmod(cache_file_name, 0o666 & ~umask)
PermissionError: [Errno 1] Operation not permitted: '/mnt/data/cache_55b1844ebd9dffff_00000_of_00001.arrow'

Fix

Wrap each of the two os.chmod call sites in src/datasets/arrow_dataset.py with try/except OSError, logging the failure at debug level. The permission update stays best-effort: sane 0o666 & ~umask permissions on POSIX filesystems that support it, silent graceful degradation on flat-permission mounts where there's nothing to set.

No new API surface, no env-var toggle — the issue body suggested an opt-out env var, but catching OSError is narrower and doesn't require users to know about a new knob. If a future caller actually needs to detect the degraded mode, they can observe the debug log.

Tests

New regression test test_map_survives_chmod_permission_error in tests/test_arrow_dataset.py:

  • Patches datasets.arrow_dataset.os.chmod to raise PermissionError for paths inside the pytest tmp_path (so pytest's own housekeeping is unaffected)
  • Runs Dataset.from_dict({\"a\": [1, 2, 3]}).map(lambda x: {\"b\": x[\"a\"] * 2}, cache_file_name=...)
  • Asserts the mapped values are correct ([2, 4, 6]), the cache file exists on disk, and the chmod guard was actually exercised

Verified the test fails on main without the fix (propagates PermissionError) and passes with it.

Scope notes

  • Two chmod sites changed in arrow_dataset.py — both follow the identical shutil.move → umask → chmod pattern
  • No behavior change on POSIX filesystems that accept chmod
  • Issue author asked about an env-var opt-out; this goes narrower (catch-the-error) because the chmod is genuinely best-effort and failure is recoverable

Checklist

  • Added a regression test in tests/
  • ruff check and ruff format --check clean on touched files
  • Related test_map_caching* tests still pass
  • Scope isolated to one bug
…uggingface#8125)

map() and the indices cache writer both call os.chmod on the freshly
moved cache file. On flat-permission filesystems (GCS FUSE, S3 FUSE
mounts) chmod raises PermissionError / OSError, aborting the whole
map() call even though the cache file itself was written successfully.

Wrap each chmod in try/except OSError with a debug log. The chmod
remains best-effort: it still sets sane 0o666-masked permissions on
filesystems that support it, and silently degrades on ones that don't.

Adds a regression test that patches os.chmod to raise PermissionError
only for paths inside the pytest tmp-path and asserts that Dataset.map
still returns the correct mapped values and the cache file ends up on
disk.

Fixes huggingface#8125
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant