fix: multi-GPU training support for vision models #3809

Vinayyyy7 · 2025-12-31T16:17:34Z

This PR fixes multi-GPU training for vision models when using device_map="auto" or device_map="balanced".

when running on multiple GPUs, setting device_map="auto/balanced" with FastVisionModel Class causes model to be split across devices. During training, this results in hidden_states being on one GPU (e.g., cuda:1) while lm_head is on another (eg: cuda:0). The fused cross-entropy loss then computes gradients on the lm_head device but PyTorch expects them back on the original hidden_states device, causing a RuntimeError.

Errors we see:

Unsupported: NotImplementedError/UnsupportedFakeTensorException when running FX node
  Explanation: Dynamo failed to run FX node with fake tensors: call_function <function _autograd_grad at 0x7adc2d2d8180>(*((GradTrackingTensor(lvl=1, value=
        FakeTensor(..., device='cuda:0', size=())
    ),), [GradTrackingTensor(lvl=1, value=
        FakeTensor(..., device='cuda:1', size=(s97, 2048), dtype=torch.float16,
                   requires_grad=True)
    )]), **{'create_graph': True}): got NotImplementedError('Cannot access storage of TensorWrapper')
  Hint: If the op is a PyTorch op, please file an issue to PyTorch.

  Developer debug context: 

 For more details about this graph break, please visit: https://meta-pytorch.github.io/compile-graph-break-site/gb/gb0087.html

from user code:
   File "/usr/local/lib/python3.11/dist-packages/unsloth_zoo/fused_losses/cross_entropy_loss.py", line 276, in accumulate_chunk
    (chunk_loss, (unscaled_loss,)) = torch.func.grad_and_value(
  File "/usr/local/lib/python3.11/dist-packages/torch/_functorch/apis.py", line 449, in wrapper
    return eager_transforms.grad_and_value_impl(
  File "/usr/local/lib/python3.11/dist-packages/torch/_functorch/vmap.py", line 47, in fn
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.11/dist-packages/torch/_functorch/eager_transforms.py", line 1390, in grad_and_value_impl
    flat_grad_input = _autograd_grad(

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

OR

[rank1]: ValueError: You can't train a model that has been loaded in 8-bit or 4-bit precision on a different device than the one you're training on. Make sure you loaded the model on the correct device using for example `device_map={'':torch.cuda.current_device()}` or `device_map={'':torch.xpu.current_device()}`

this fix adds distributed training detection in FastModel.from_pretrained() & FastBaseModel.from_pretrained(). when distributed training is detected and device_map is set to auto/balanced, it overrides to use data-parallel mode where each GPU loads a full copy of the model.

Note: This PR works together with a corresponding fix in `unsloth-zoo` that handles the gradient device mismatch in the fused CE loss.

unslothai/unsloth-zoo#423

Tested on Kaggle with 2x T4 GPUs using Qwen/Qwen3-VL-2B-Instruct
Result: IT WORKED

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 2
   \\   /|    Num examples = 10,000 | Num Epochs = 1 | Total steps = 1,250
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 139,460,608 of 2,266,992,640 (6.15% trained)
Step | Training Loss
-- | --
1 | 46.795200
2 | 109.610000
3 | 40.122600
4 | 44.093400
5 | 32.455600
6 | 47.650400
7 | 60.632800
8 | 24.140800
9 | 63.971800
10 | 19.096700
11 | 20.428400
12 | 17.989900
13 | 10.218600
14 | 20.911900
15 | 16.450800
16 | 6.303300
17 | 4.037200
18 | 4.878000
19 | 2.447200
20 | 2.050200

for more information, see https://pre-commit.ci

gemini-code-assist · 2025-12-31T16:17:50Z

Summary of Changes

Hello @Vinayyyy7, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical issues preventing successful multi-GPU training for vision models when using automatic or balanced device mapping. By intelligently detecting distributed training environments, the system now defaults to a data-parallel strategy, ensuring each GPU receives a complete model copy. This prevents model splitting across devices, which previously led to gradient calculation errors and RuntimeError exceptions, thereby enabling robust and efficient multi-GPU training for vision models.

Highlights

Multi-GPU Training Fix: Addresses RuntimeError and ValueError encountered during multi-GPU training of vision models when device_map is set to "auto" or "balanced", which caused model splitting and gradient device mismatches.
Device Map Override: Implements a mechanism to detect distributed training and, if device_map is "auto", "balanced", or "balanced_low_0", automatically switches to a data-parallel approach where each GPU loads a full model copy, preventing splitting issues.
Warning for Device Map: Adds a warning to inform users when the device_map is overridden for distributed vision model training, explaining why data-parallel mode is being used instead of device splitting.
Integration with unsloth-zoo: Notes that this PR works in conjunction with a corresponding fix in unsloth-zoo to handle gradient device mismatches in the fused cross-entropy loss.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a critical issue with multi-GPU training for vision models by correctly handling device_map in distributed environments. The fix prevents model splitting across GPUs, which was causing runtime errors. The overall approach is sound. However, the implementation introduces code duplication between FastModel.from_pretrained and FastBaseModel.from_pretrained. My review includes suggestions to refactor this by removing the redundant code and centralizing the logic in the base model, which will improve maintainability and efficiency.

gemini-code-assist · 2025-12-31T16:19:22Z

unsloth/models/loader.py

+        # Fix for multi-GPU distributed training
+        # When using distributed training (e.g., 2x T4 on Kaggle), device_map="auto"/"balanced"
+        # splits the model across GPUs which can cause gradient device mismatch errors.
+        # Instead, use data-parallel approach where each GPU gets a full model copy.
+        from .loader_utils import prepare_device_map, is_distributed
+
+        if is_distributed():
+            prepared_device_map, _ = prepare_device_map()
+            if device_map in ("auto", "balanced", "balanced_low_0"):
+                import warnings
+
+                warnings.warn(
+                    f"Unsloth: Multi-GPU device splitting (device_map='{device_map}') is not supported "
+                    f"for vision/multimodal models in distributed training. "
+                    f"Using device_map={prepared_device_map} instead (data-parallel mode). "
+                    f"Each GPU will load a full copy of the model.",
+                    stacklevel = 2,


This logic for handling device_map in distributed training is duplicated from FastBaseModel.from_pretrained in unsloth/models/vision.py. Since FastModel.from_pretrained calls FastBaseModel.from_pretrained later in its execution, this check is performed twice. This is inefficient and creates a maintenance burden. To adhere to the DRY (Don't Repeat Yourself) principle, this entire block should be removed. The logic in FastBaseModel will handle this for all relevant models, especially after generalizing its warning message as suggested in the other comment.

gemini-code-assist · 2025-12-31T16:19:22Z

unsloth/models/vision.py

+            if device_map in ("auto", "balanced", "balanced_low_0"):
+                import warnings
+
+                warnings.warn(


To make this warning message more general and suitable for all models that use this base class, consider changing 'vision models' to 'vision/multimodal models'. This will allow you to remove the duplicated logic from FastModel.from_pretrained while keeping the warning accurate for all inheriting classes.

Suggested change

warnings.warn(

f"for vision/multimodal models in distributed training. "

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-31T16:23:09Z

unsloth/models/vision.py

+                    f"Using device_map={prepared_device_map} instead (data-parallel mode). "
+                    f"Each GPU will load a full copy of the model.",
+                    stacklevel = 2,
+                )
+                device_map = prepared_device_map


Avoid overriding device_map during distributed inference

The new distributed guard now unconditionally replaces device_map with the per-rank map whenever is_distributed() is true. That code runs for any distributed launch (including inference with torchrun/accelerate), so runs that relied on device_map="auto"/"balanced" to shard a large vision model across GPUs now get forced into data-parallel replication on a single device, likely causing OOM or losing sharding entirely. Before this change, distributed inference respected the user-supplied sharded map. Consider limiting this override to training-only flows or providing an opt-out so multi-GPU inference continues to work.

Useful? React with 👍 / 👎.

your concern: This fix might break distributed inference that relies on device_map="auto"/"balanced" for model sharding.

why this is unlikely to be an issue:

Typical inference workflow in Unsloth notebooks:
After training, users typically run inference on a single GPU within the same notebook:

FastVisionModel.for_inference(model) # Switch to inference mode inputs = tokenizer(image, text, ...).to("cuda") # Single GPU model.generate(**inputs, max_new_tokens=128)

this is single-process inference and is NOT affected by this fix. the
is_distributed() check returns False in this case.

For production inference with large models:
Users typically use vLLM (fast_inference=True) OR transformers, NOT distributed device_map="balanced" with torchrun OR accelerate train.py scripts as unsloth docs suggest

The actual use case this fixes:

suppose user like me runs multi-GPU training on Kaggle 2x T4

torchrun --nproc_per_node=2 train.py

without this fix, the model gets split across GPUs causing gradient device mismatch errors. fix enables data-parallel training where each GPU gets a full model copy.

How this was tested:

I tested this fix on Kaggle with 2x T4 GPUs by running code directly in notebook cells - no torchrun or accelerate launch commands. Just regular notebook execution with device_map="auto":

from unsloth import FastVisionModel model, tokenizer = FastVisionModel.from_pretrained( "Qwen/Qwen3-VL-2B-Instruct", load_in_4bit=False, # 16-bit device_map="auto", )

Training proceeds normally on 2 GPUs

trainer.train() # works now consistent with existing behavior:

Text generation models using FastLanguageModel already work fine with device_map="auto"/"balanced" on multi-GPU setups without requiring torchrun or accelerate commands. This fix brings the same capability to vision models using FastVisionModel which earlier crashed...

In a normal notebook colab/kaggle env (which is how most Unsloth users run training), the fix in cross_entropy_loss.py handles gradient device mismatch directly

inference is unaffected:

after training, users run inference on a single GPU:

FastVisionModel.for_inference(model) inputs = tokenizer(image, text, ...).to("cuda") model.generate(**inputs) # Single GPU inference

This is the standard workflow shown in Unsloth's official notebooks and remains completely unaffected.
But I do expect an actual human review from a maintainer for anything further, thanks...

Datta0 · 2026-01-01T05:16:46Z

Hey @Vinayyyy7 thanks for the contribution
I'm not 100% sure if switching to DataParallel is what users would want when they set device_map=balanced.
So maybe we should error out instead of warning so that they can redo with appropriate device_map? We can say something like

"device map balanced/auto is not supported for FastVisionModel and we'd recommend using Data Parallel by setting device_map=xyz. Note that this uses higher memory than balanced/auto on each gpu but equivalent to running single gpu training"

thoughts @danielhanchen ?

Vinayyyy7 · 2026-01-01T11:45:36Z

agree that erroring out is cleaner than silently overriding. Users should explicitly knoww what's happening.

clarifications:

Splitting model across GPUs always crashed due to gradient mismatch errors, companion PR in unsloth-zoo (unslothai/unsloth-zoo#423) handles this by tracking the original device and moving gradients back. But even with that fix, model splitting has other issues with distributed training and quantization

What device_map we can suggest users:
For data-parallel training where each GPU loads a full model copy and which actually works, users should use:

device_map=None # for multi gpu
OR
explicitly device_map={"": "cuda:0"} # for single GPU

Proposed error message:

raise ValueError(
    f"Unsloth: device_map='{device_map}' is not supported for FastVisionModel in multi-GPU training. "
    f"Model splitting across GPUs causes gradient device mismatch errors. "
    f"For multi-GPU training, remove device_map or set device_map=None to use data-parallel mode "
    f"where each GPU loads a full copy of the model. "
    f"Note: This uses more VRAM per GPU but provides equivalent training to single GPU."
)

Is this warning and `device_map=None` good suggestion? so that if users have enough VRAM they can do training 2x faster on 2 GPUs instead of keeping 1 empty, most users might prefer kaggle as it provides 12 hours of continuous GPU while colab only 4-5 hours and only 1 GPU.

@Datta0

Datta0 · 2026-01-01T14:35:24Z

Hey @Vinayyyy7
Yeah the error looks fine. Though I too am not 100% sure what device_map = none ends up doing. I can't seem to readily find it either.

Vinayyyy7 · 2026-01-01T15:13:22Z

it's just an if condition since we might warn users saying auto/balaned not supported in FastVisionModel like above example error, telling them you can use data parallel by device map None. this way they know it requires more VRAM but would work and faster compared to 1 GPU

note: right know this is not implemented in PR, rn it just directly does data parallel if detects auto/balanced.

Vinayyyy7 and others added 3 commits December 31, 2025 21:12

fix: multi-GPU training support for vision models

40c2f30

correct the comments

f6520d4

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3ac6ad

for more information, see https://pre-commit.ci

Vinayyyy7 mentioned this pull request Dec 31, 2025

fix: return gradients on original device for multi-GPU training unslothai/unsloth-zoo#423

Open

gemini-code-assist bot reviewed Dec 31, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 31, 2025

View reviewed changes

Vinayyyy7 mentioned this pull request Dec 31, 2025

[Feature] FastVisionModel #3495

Open

Merge branch 'unslothai:main' into fix/multi-gpu-vision-model-training

28750a6

Merge branch 'unslothai:main' into fix/multi-gpu-vision-model-training

1d77c5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: multi-GPU training support for vision models #3809

fix: multi-GPU training support for vision models #3809

Vinayyyy7 commented Dec 31, 2025 •

edited

Loading

gemini-code-assist bot commented Dec 31, 2025

gemini-code-assist bot left a comment

gemini-code-assist bot Dec 31, 2025

gemini-code-assist bot Dec 31, 2025

chatgpt-codex-connector bot left a comment

chatgpt-codex-connector bot Dec 31, 2025

Vinayyyy7 Dec 31, 2025

Datta0 commented Jan 1, 2026 •

edited

Loading

Vinayyyy7 commented Jan 1, 2026

Datta0 commented Jan 1, 2026

Vinayyyy7 commented Jan 1, 2026

Labels

2 participants

	warnings.warn(
	f"for vision/multimodal models in distributed training. "

Uh oh!

fix: multi-GPU training support for vision models #3809

Are you sure you want to change the base?

fix: multi-GPU training support for vision models #3809

Conversation

Vinayyyy7 commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note: This PR works together with a corresponding fix in unsloth-zoo that handles the gradient device mismatch in the fused CE loss.

gemini-code-assist bot commented Dec 31, 2025

Summary of Changes

Highlights

Footnotes

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

gemini-code-assist bot Dec 31, 2025

Choose a reason for hiding this comment

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

chatgpt-codex-connector bot Dec 31, 2025

Choose a reason for hiding this comment

Vinayyyy7 Dec 31, 2025

Choose a reason for hiding this comment

suppose user like me runs multi-GPU training on Kaggle 2x T4

How this was tested:

Training proceeds normally on 2 GPUs

Datta0 commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Vinayyyy7 commented Jan 1, 2026

Is this warning and device_map=None good suggestion? so that if users have enough VRAM they can do training 2x faster on 2 GPUs instead of keeping 1 empty, most users might prefer kaggle as it provides 12 hours of continuous GPU while colab only 4-5 hours and only 1 GPU.

Datta0 commented Jan 1, 2026

Vinayyyy7 commented Jan 1, 2026

Labels

2 participants

Vinayyyy7 commented Dec 31, 2025 •

edited

Loading

Note: This PR works together with a corresponding fix in `unsloth-zoo` that handles the gradient device mismatch in the fused CE loss.

Datta0 commented Jan 1, 2026 •

edited

Loading

Is this warning and `device_map=None` good suggestion? so that if users have enough VRAM they can do training 2x faster on 2 GPUs instead of keeping 1 empty, most users might prefer kaggle as it provides 12 hours of continuous GPU while colab only 4-5 hours and only 1 GPU.