Add Idefics3 support (Granite Docling VLM)#4090
Add Idefics3 support (Granite Docling VLM)#4090gaztrabisme wants to merge 3 commits intounslothai:mainfrom
Conversation
Closes unslothai#4079 - Add "idefics3" to VLLM_SUPPORTED_VLM - Add _fix_requires_grad_hooks_for_kwargs() to handle models whose forward() passes all arguments via kwargs (empty positional args tuple causes requires_grad_pre_hook to raise RuntimeError) Tested with ibm-granite/granite-docling-258M (258M params): - SFT on unsloth/LaTeX_OCR: loss 3.1 → 1.3, 30 steps, 3.7GB VRAM - GRPO with TRL GRPOTrainer: full RL loop verified - LoRA/DoRA + gradient checkpointing + generation all pass
Summary of ChangesHello @gaztrabisme, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for Idefics3 Vision-Language Models (VLMs) within the framework, specifically enabling fine-tuning of models like Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request adds support for the Idefics3 architecture (e.g., Granite Docling VLM) to FastVisionModel. It includes Idefics3 in the supported VLM list and introduces a fix for requires_grad pre-hooks to handle models that pass all arguments via keyword arguments, which results in empty positional argument tuples. The fix is implemented via _fix_requires_grad_hooks_for_kwargs, which replaces problematic hooks with a safer version. The implementation is generally sound, but there are opportunities to improve the robustness of the hook logic and remove redundant code.
unsloth/models/vision.py
Outdated
| def _safe_requires_grad_pre_hook(module, input): | ||
| type_input = type(input) | ||
| if type_input is torch.Tensor: | ||
| input.requires_grad_(True) | ||
| elif type_input is tuple or type_input is list: | ||
| if len(input) == 0: | ||
| return | ||
| if torch.is_floating_point(input[0]): | ||
| input[0].requires_grad_(True) | ||
|
|
There was a problem hiding this comment.
The _safe_requires_grad_pre_hook implementation can be made more robust. Specifically, torch.is_floating_point(input[0]) will raise a TypeError if input[0] is not a torch.Tensor (e.g., if it is None or a raw integer). Additionally, the check for type_input is torch.Tensor is likely redundant as input in a module's forward_pre_hook is typically a tuple, but it's good to handle both cases safely. The pass at line 156 is also unnecessary.
def _safe_requires_grad_pre_hook(module, input):
if isinstance(input, (tuple, list)):
if len(input) > 0 and isinstance(input[0], torch.Tensor) and torch.is_floating_point(input[0]):
input[0].requires_grad_(True)
elif isinstance(input, torch.Tensor) and torch.is_floating_point(input):
input.requires_grad_(True)There was a problem hiding this comment.
The type(x) is Y pattern and trailing pass match the existing codebase style (used throughout vision.py and unsloth_zoo). The upstream requires_grad_pre_hook also calls torch.is_floating_point(input[0]) without an isinstance guard — input in _forward_pre_hooks is always a tuple of tensors.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 96e4d87889
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
unsloth/models/vision.py
Outdated
| if len(input) == 0: | ||
| return |
There was a problem hiding this comment.
Mark kwargs tensors for checkpointed forwards
On single-process training paths post_patch_model keeps use_reentrant=True, so checkpointed layers still require at least one grad-enabled input tensor to propagate parameter gradients. This hook now returns immediately when positional input is empty, but kwargs-only layers (the exact Idefics3 case) then never get any tensor switched to requires_grad=True; the crash is avoided, but LoRA parameters inside those checkpointed blocks can silently receive no gradients. Please handle kwargs tensors (or force non-reentrant checkpointing here) instead of early-returning on empty positional args.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
This does not apply here. For Idefics3, requires_grad_for_gradient_checkpointing finds model.text_model has get_input_embeddings and registers a post hook on embed_tokens — the requires_grad_pre_hook fallback is never used:
PRE hooks with requires_grad: 0
POST hooks: embed_tokens -> requires_grad_post_hook
Our replacement is defensive safety for models that might hit the fallback path with kwargs-only forward. The actual gradient flow is unaffected — confirmed by SFT loss 3.11->1.29 and GRPO grad_norm 0.66->1.45.
|
I think @mmathew23 handled something like this where args are passed via kwargs which causes trouble with gradients. @gaztrabisme can you check if you can find that and use the same thing so that we have consistent fix and not duplicate. |
|
@Datta0 Thanks for the pointer. I looked through mmathew23's contributions — the closest match is Our issue is at a different layer — the if len(input) == 0:
raise RuntimeError("Unsloth: Failed to make input require gradients\!")
# print(f" WARNING: Empty list input to {module.__class__.__name__}\!")
# returnOur fix replaces that hook post-registration to do the Is there a specific commit or PR I might have missed? Happy to align if there's an existing pattern for this. |
|
Yeah I think you already found the right commit. I'd let @mmathew23 comment more on this :) |
|
@gaztrabisme do you have sample sft and grpo scripts to test this? It might be best to make a PR to update https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_zoo/peft_utils.py#L162-L325, rather than redefine most of the hook in vision.py. |
|
Thanks @mmathew23! Good suggestion — fixing it at the source in Test scripts (SFT + GRPO) are here: https://gist.github.com/gaztrabisme/d5c5f405146e679eae67b6ec1ee2e9eb SFT — GRPO — 20 steps with Both use I can put together a PR to |
danielhanchen
left a comment
There was a problem hiding this comment.
Tested on 1x NVIDIA B200 (CUDA 12.8, torch 2.9.1, transformers 4.57.6, vLLM 0.15.1) with 7 vision notebooks:
| Notebook | Result |
|---|---|
| Qwen3_VL_8B_Vision | PASS |
| Qwen3_VL_8B_Vision_GRPO | PASS |
| Gemma3_4B_Vision | PASS |
| Gemma3_4B_Vision_GRPO | PASS |
| Gemma3N_4B_Vision | PASS |
| Deepseek_OCR_3B | FAIL (pre-existing: easydict import blocked by unsloth_zoo patched_import) |
| Qwen2_5_7B_VL_GRPO | FAIL (pre-existing: vLLM multimodal tokenizer bug) |
5/7 pass. Both failures are pre-existing env issues unrelated to this PR.
Code review notes:
- As mmathew23 mentioned, the
_fix_requires_grad_hooks_for_kwargsfix could be a 1-line change inunsloth_zoo/peft_utils.pyrather than a 35-line workaround invision.py. The upstream fix would benefit all models, not just Idefics3. - The
VLLM_SUPPORTED_VLMaddition is correct and the error message list should also include Idefics3 (currently only listed in the supported set but not in the user-facing error text). - Extra blank lines at the end of the hook function (style nit).
Functionally works for all tested vision models.
Moved the requires_grad_pre_hook fix upstream to unsloth-zoo#514. This PR now only adds "idefics3" to VLLM_SUPPORTED_VLM and the error message text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks for the thorough testing @danielhanchen! All three points addressed:
PR is now 2 lines changed (1 file). |
|
@mmathew23 Could you help review this and my PR over at unsloth-zoo? Many thanks! |
Closes #4079
Adds Idefics3 architecture support to
FastVisionModel, enabling fine-tuning ofibm-granite/granite-docling-258Mand other Idefics3-based models.Changes
unsloth/models/vision.py— 1 file, 39 insertions:"idefics3"toVLLM_SUPPORTED_VLM_fix_requires_grad_hooks_for_kwargs()— replacesrequires_grad_pre_hookon modules after registration to handle kwargs-only forward signatures (Idefics3's vision encoder passes all args via kwargs → empty positional args tuple →RuntimeError: Failed to make input require gradients)Test results
ibm-granite/granite-docling-258M(257.5M params), RTX 5080 16GB:SFT —
unsloth/LaTeX_OCRdataset (68K samples), LoRA r=32 + DoRA:GRPO — TRL
GRPOTrainer, text prompts, custom reward:Pipeline: load → LoRA/DoRA → forward → backward → train → generate → save — all pass.
The hook fix is a no-op for models with non-empty positional args (all existing VLMs).