protect binfmt_misc from cross-distro wipe at shutdown#40621
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a WSL2/systemd shutdown regression where systemd-shutdown can clear the kernel-global binfmt_misc registry (by writing -1 to /proc/sys/fs/binfmt_misc/status), breaking interop in other concurrently running distros. The fix hardens each distro’s mount namespace by bind-mounting a read-only lock file over /proc/sys/fs/binfmt_misc/status, and updates WSLInterop registration to use the F (fix-binary) flag, with new unit tests covering both the mechanism and an end-to-end cross-distro scenario.
Changes:
- Add
LockBinfmtStatusReadOnly()to bind-mount a read-only file over/proc/sys/fs/binfmt_misc/status(per distro mount namespace) to block registry wipes at shutdown. - Register
WSLInteropwith theFflag in VM paths to keep the interpreter resolved across mount namespaces. - Add/replace Windows unit tests validating the lock behavior and regression coverage across distro termination.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| test/windows/UnitTests.cpp | Replaces prior systemd/binfmt coverage with two tests: mechanism validation and cross-distro regression scenario. |
| src/linux/init/main.cpp | Switches WSLInterop registration string to a VM-specific macro that includes the F flag. |
| src/linux/init/init.cpp | Removes prior systemd service override generation and introduces LockBinfmtStatusReadOnly() invoked during systemd boot. |
| src/linux/init/binfmt.h | Adds a VM-specific interop registration macro using F (fix-binary) flag and documents flag behavior. |
f70720d to
11d86eb
Compare
11d86eb to
d24c5fa
Compare
Windows interop in every running WSL2 distro silently breaks whenever a
sibling systemd-enabled distro shuts down, surfacing to users as:
/bin/bash: line 1: /mnt/c/Windows/system32/cmd.exe:
cannot execute binary file: Exec format error
Root cause: `systemd-shutdown` calls `disable_binfmt()` during clean
shutdown, which writes `-1` to `/proc/sys/fs/binfmt_misc/status`.
binfmt_misc is a single kernel-global registry shared across the WSL VM
(distros do not isolate it via a user namespace), so that one write wipes
every entry -- including WSLInterop -- for every running distro.
Fix: each per-distro init bind-mounts a read-only file over
`/proc/sys/fs/binfmt_misc/status` in its own mount namespace before
exec'ing the distro's init. systemd-shutdown's wipe write then fails with
EROFS; systemd logs a warning and continues normally (its
`binfmt_mounted_and_writable()` helper deliberately tolerates this
case). Per-entry unregister (`echo -1 > .../<name>`) and runtime
registration (`echo ... > .../register`) target different files and are
unaffected, so callers retain full control over their own binfmt entries.
`LockBinfmtStatusReadOnly` is idempotent: it bails early if binfmt_misc
isn't mounted, no-ops if `/status` already resolves to our lock file,
and recovers from a stale foreign mount via `umount2(MNT_DETACH)`
followed by a retry. The existing `[boot] protectBinfmt` wsl.conf key
(default true) now controls the bind-mount and acts as a kill switch for
users who want to manage binfmt_misc themselves.
WSLInterop is also re-registered from mini_init with the `F`
(fix-binary) flag so the interpreter is opened at registration time and
remains valid across mount namespaces.
Tests:
* `BinfmtStatusIsLocked` -- mechanism test: `/status` is its own
mountpoint, writes fail with EROFS, WSLInterop survives the wipe
attempt, /register and per-entry unregister still work, and the
`protectBinfmt=false` kill switch removes the bind-mount.
* `BinfmtSurvivesDistroTermination` -- end-to-end regression test:
imports a systemd-enabled peer distro, terminates it, and asserts
that the primary distro's Windows interop still works.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d24c5fa to
26c5835
Compare
OneBlue
left a comment
There was a problem hiding this comment.
LGTM. This is the 3rd iteration of us trying to resolve this issue. Hopefully at some point we'll have kernel support to namespace the binfmt_misc interpreters, but until then I think the best we'll be able to do.
One small caveat of this approach would be that if a distro has an explicit systemd mount over binfmt_misc, it would override the read-only mount, but AFAIK distros don't do that
Summary of the Pull Request
Fixes the cross-distro
binfmt_miscwipe at systemd shutdown. When any systemd-enabled WSL2 distro terminates,systemd-shutdown'sdisable_binfmt()writes-1to/proc/sys/fs/binfmt_misc/status, which clears the entire kernel-global registry. Because WSL distros share that registry, every other running distro loses WSLInterop and Windows interop breaks with:This change bind-mounts a read-only file over
/proc/sys/fs/binfmt_misc/statusin each per-distro mount namespace so the wipe write fails withEROFSandsystemd-shutdowncontinues normally. Per-entry registration/unregistration is unaffected.PR Checklist
BinfmtStatusIsLocked(mechanism) andBinfmtSurvivesDistroTermination(end-to-end regression)[boot] protectBinfmtkey is unchanged from a user''s perspective)Detailed Description of the Pull Request / Additional comments
Root cause.
binfmt_miscis a single kernel-global registry shared across the WSL VM (WSL distros do not isolate it via a user namespace).systemd-shutdowncallsdisable_binfmt()during clean shutdown which writes-1to/proc/sys/fs/binfmt_misc/status; that one write clears every entry in the registry — including WSLInterop — for every running distro.Fix. Each per-distro init bind-mounts
/run/wsl/binfmt-status-lock(a regular file containingenabled\n) over/proc/sys/fs/binfmt_misc/status, then remounts the bind-mount read-only. The mount lives in the per-distro mount namespace and is inherited by systemd. When systemd-shutdown later runs, the write to/statusfails withEROFS; systemd''sbinfmt_mounted_and_writable()helper deliberately tolerates this case, so systemd-shutdown logs a warning and continues normally. Reads of/statusstill returnenabled\n, so callers that probe for binfmt_misc availability keep working. Per-entry unregister (echo -1 > /proc/sys/fs/binfmt_misc/<name>) and runtime registration (echo ... > /proc/sys/fs/binfmt_misc/register) target different files and are unaffected.If the read-only remount fails, the writable bind-mount is detached so we don''t leave a writable shadow over the real
/status.The existing
[boot] protectBinfmtwsl.conf key (defaulttrue) now controls the bind-mount and remains as a kill switch for users who want to managebinfmt_miscthemselves.WSLInterop is also re-registered from
mini_initwith theF(fix-binary) flag so the kernel opens the interpreter at registration time and the entry remains valid across mount namespaces.What this does not change. A distro can still override its own WSLInterop entry locally (e.g., via
/usr/lib/binfmt.d/dummy.conf). The fix only prevents one distro from wiping the registry for everyone else.Validation Steps Performed
Built locally on Windows and deployed
init+initrd.imgtoC:\Program Files\WSL\tools\. Ran each test individually with the deployed bits:BinfmtStatusIsLocked— passes. Verifies/statusis a mountpoint,echo -1 > /statusfails withEROFS,WSLInteropsurvives the wipe attempt,/registerand per-entry unregister still work, interop still functions, and theprotectBinfmt=falsekill switch removes the bind-mount.BinfmtSurvivesDistroTermination— passes. Imports a systemd-enabled peer distro, terminates it (triggering systemd shutdown), asserts the primary distro''scmd.exeinterop still works and theWSLInteropentry retains theFflag.Interop,Systemd*(System, User, Disabled, NoClearTmpUnit, KillInitTerminatesDistro),InitReadonly,InitPermissions,WslConfWarnings— all pass.