Skip to content

Add Idefics3 support (Granite Docling VLM)#4090

Open
gaztrabisme wants to merge 3 commits intounslothai:mainfrom
gaztrabisme:feat/idefics3-support
Open

Add Idefics3 support (Granite Docling VLM)#4090
gaztrabisme wants to merge 3 commits intounslothai:mainfrom
gaztrabisme:feat/idefics3-support

Conversation

@gaztrabisme
Copy link

Closes #4079

Adds Idefics3 architecture support to FastVisionModel, enabling fine-tuning of ibm-granite/granite-docling-258M and other Idefics3-based models.

Changes

unsloth/models/vision.py — 1 file, 39 insertions:

  1. Add "idefics3" to VLLM_SUPPORTED_VLM
  2. Add _fix_requires_grad_hooks_for_kwargs() — replaces requires_grad_pre_hook on modules after registration to handle kwargs-only forward signatures (Idefics3's vision encoder passes all args via kwargs → empty positional args tuple → RuntimeError: Failed to make input require gradients)

Test results

ibm-granite/granite-docling-258M (257.5M params), RTX 5080 16GB:

SFTunsloth/LaTeX_OCR dataset (68K samples), LoRA r=32 + DoRA:

Loss:  3.11 → 1.29 (30 steps)
VRAM:  3.7 GB peak (24%)
Time:  107s

GRPO — TRL GRPOTrainer, text prompts, custom reward:

Loss:  -0.12 → -0.27 (3 steps, policy gradient)
Grad:  0.66 → 1.45 (non-zero, flowing)

Pipeline: load → LoRA/DoRA → forward → backward → train → generate → save — all pass.

The hook fix is a no-op for models with non-empty positional args (all existing VLMs).

Closes unslothai#4079

- Add "idefics3" to VLLM_SUPPORTED_VLM
- Add _fix_requires_grad_hooks_for_kwargs() to handle models whose
  forward() passes all arguments via kwargs (empty positional args
  tuple causes requires_grad_pre_hook to raise RuntimeError)

Tested with ibm-granite/granite-docling-258M (258M params):
- SFT on unsloth/LaTeX_OCR: loss 3.1 → 1.3, 30 steps, 3.7GB VRAM
- GRPO with TRL GRPOTrainer: full RL loop verified
- LoRA/DoRA + gradient checkpointing + generation all pass
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @gaztrabisme, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for Idefics3 Vision-Language Models (VLMs) within the framework, specifically enabling fine-tuning of models like ibm-granite/granite-docling-258M. The core change involves extending the list of supported VLMs and implementing a crucial fix for gradient tracking in models that utilize kwargs-only forward signatures, preventing runtime errors during training.

Highlights

  • Idefics3 VLM Support: Added "idefics3" to the list of supported Vision-Language Models (VLLM_SUPPORTED_VLM), enabling compatibility with Idefics3-based models like ibm-granite/granite-docling-258M.
  • Gradient Hook Fix for Kwargs: Introduced _fix_requires_grad_hooks_for_kwargs() to address a RuntimeError when models (like Idefics3's vision encoder) pass all arguments via kwargs, ensuring proper gradient tracking by replacing existing requires_grad_pre_hook with a safe version.
  • Integration into PEFT Model Loading: Integrated the new gradient hook fix into the get_peft_model function, ensuring it is applied automatically when loading and preparing models for fine-tuning.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • unsloth/models/vision.py
    • Added "idefics3" to the VLLM_SUPPORTED_VLM list.
    • Implemented the _fix_requires_grad_hooks_for_kwargs function to handle gradient tracking for models with kwargs-only forward signatures.
    • Called _fix_requires_grad_hooks_for_kwargs within the get_peft_model function to apply the fix during model preparation.
Activity
  • The author has provided detailed test results for ibm-granite/granite-docling-258M covering SFT and GRPO training, demonstrating successful loss reduction, VRAM usage, and training times.
  • The pipeline for loading, LoRA/DoRA, forward/backward passes, training, generation, and saving has been verified to pass.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Idefics3 architecture (e.g., Granite Docling VLM) to FastVisionModel. It includes Idefics3 in the supported VLM list and introduces a fix for requires_grad pre-hooks to handle models that pass all arguments via keyword arguments, which results in empty positional argument tuples. The fix is implemented via _fix_requires_grad_hooks_for_kwargs, which replaces problematic hooks with a safer version. The implementation is generally sound, but there are opportunities to improve the robustness of the hook logic and remove redundant code.

Comment on lines 147 to 156
def _safe_requires_grad_pre_hook(module, input):
type_input = type(input)
if type_input is torch.Tensor:
input.requires_grad_(True)
elif type_input is tuple or type_input is list:
if len(input) == 0:
return
if torch.is_floating_point(input[0]):
input[0].requires_grad_(True)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _safe_requires_grad_pre_hook implementation can be made more robust. Specifically, torch.is_floating_point(input[0]) will raise a TypeError if input[0] is not a torch.Tensor (e.g., if it is None or a raw integer). Additionally, the check for type_input is torch.Tensor is likely redundant as input in a module's forward_pre_hook is typically a tuple, but it's good to handle both cases safely. The pass at line 156 is also unnecessary.

    def _safe_requires_grad_pre_hook(module, input):
        if isinstance(input, (tuple, list)):
            if len(input) > 0 and isinstance(input[0], torch.Tensor) and torch.is_floating_point(input[0]):
                input[0].requires_grad_(True)
        elif isinstance(input, torch.Tensor) and torch.is_floating_point(input):
            input.requires_grad_(True)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type(x) is Y pattern and trailing pass match the existing codebase style (used throughout vision.py and unsloth_zoo). The upstream requires_grad_pre_hook also calls torch.is_floating_point(input[0]) without an isinstance guard — input in _forward_pre_hooks is always a tuple of tensors.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 96e4d87889

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 152 to 153
if len(input) == 0:
return

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mark kwargs tensors for checkpointed forwards

On single-process training paths post_patch_model keeps use_reentrant=True, so checkpointed layers still require at least one grad-enabled input tensor to propagate parameter gradients. This hook now returns immediately when positional input is empty, but kwargs-only layers (the exact Idefics3 case) then never get any tensor switched to requires_grad=True; the crash is avoided, but LoRA parameters inside those checkpointed blocks can silently receive no gradients. Please handle kwargs tensors (or force non-reentrant checkpointing here) instead of early-returning on empty positional args.

Useful? React with 👍 / 👎.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not apply here. For Idefics3, requires_grad_for_gradient_checkpointing finds model.text_model has get_input_embeddings and registers a post hook on embed_tokens — the requires_grad_pre_hook fallback is never used:

PRE hooks with requires_grad: 0
POST hooks: embed_tokens -> requires_grad_post_hook

Our replacement is defensive safety for models that might hit the fallback path with kwargs-only forward. The actual gradient flow is unaffected — confirmed by SFT loss 3.11->1.29 and GRPO grad_norm 0.66->1.45.

@Datta0
Copy link
Collaborator

Datta0 commented Feb 23, 2026

I think @mmathew23 handled something like this where args are passed via kwargs which causes trouble with gradients.
Perhaps for some other VLM (maybe Qwen3VL with Siglip?).

@gaztrabisme can you check if you can find that and use the same thing so that we have consistent fix and not duplicate.

@gaztrabisme
Copy link
Author

@Datta0 Thanks for the pointer. I looked through mmathew23's contributions — the closest match is 6ab47cf33962 in unsloth_zoo/compiler.py ("Fix gradient checkpointing layer caller kwargs"), which adds strip_kw_from_module_calls() to handle kwargs at the compiler/checkpointing-wrapper level.

Our issue is at a different layer — the requires_grad_pre_hook in unsloth_zoo/peft_utils.py (line 212-225) raises RuntimeError when positional args are empty. Idefics3's vision encoder forwards everything via kwargs, so the hook sees an empty tuple and crashes. The original code actually has a commented-out graceful return for this exact case:

if len(input) == 0:
    raise RuntimeError("Unsloth: Failed to make input require gradients\!")
    # print(f"  WARNING: Empty list input to {module.__class__.__name__}\!")
    # return

Our fix replaces that hook post-registration to do the return instead of the raise. It's a no-op for all existing VLMs (they all have non-empty positional args).

Is there a specific commit or PR I might have missed? Happy to align if there's an existing pattern for this.

@Datta0
Copy link
Collaborator

Datta0 commented Feb 23, 2026

Yeah I think you already found the right commit. I'd let @mmathew23 comment more on this :)

@mmathew23
Copy link
Collaborator

@gaztrabisme do you have sample sft and grpo scripts to test this?

It might be best to make a PR to update https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_zoo/peft_utils.py#L162-L325, rather than redefine most of the hook in vision.py.

@gaztrabisme
Copy link
Author

Thanks @mmathew23! Good suggestion — fixing it at the source in peft_utils.py makes more sense than working around it in vision.py.

Test scripts (SFT + GRPO) are here: https://gist.github.com/gaztrabisme/d5c5f405146e679eae67b6ec1ee2e9eb

SFTunsloth/LaTeX_OCR dataset, 30 steps, loss 3.11→1.29, 3.7GB VRAM, 107s.

GRPO — 20 steps with GRPOTrainer, reward computation + KL tracking + policy updates all work, 2.4GB VRAM, ~13min.

Both use FastVisionModel.from_pretrained("ibm-granite/granite-docling-258M") with LoRA rank 32 + DoRA.

I can put together a PR to unsloth-zoo that handles the empty-args case in requires_grad_pre_hook — basically uncommenting the return path that's already sketched out there. Then this PR simplifies to just adding "idefics3" to VLLM_SUPPORTED_VLM.

Copy link
Contributor

@danielhanchen danielhanchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on 1x NVIDIA B200 (CUDA 12.8, torch 2.9.1, transformers 4.57.6, vLLM 0.15.1) with 7 vision notebooks:

Notebook Result
Qwen3_VL_8B_Vision PASS
Qwen3_VL_8B_Vision_GRPO PASS
Gemma3_4B_Vision PASS
Gemma3_4B_Vision_GRPO PASS
Gemma3N_4B_Vision PASS
Deepseek_OCR_3B FAIL (pre-existing: easydict import blocked by unsloth_zoo patched_import)
Qwen2_5_7B_VL_GRPO FAIL (pre-existing: vLLM multimodal tokenizer bug)

5/7 pass. Both failures are pre-existing env issues unrelated to this PR.

Code review notes:

  1. As mmathew23 mentioned, the _fix_requires_grad_hooks_for_kwargs fix could be a 1-line change in unsloth_zoo/peft_utils.py rather than a 35-line workaround in vision.py. The upstream fix would benefit all models, not just Idefics3.
  2. The VLLM_SUPPORTED_VLM addition is correct and the error message list should also include Idefics3 (currently only listed in the supported set but not in the user-facing error text).
  3. Extra blank lines at the end of the hook function (style nit).

Functionally works for all tested vision models.

Moved the requires_grad_pre_hook fix upstream to unsloth-zoo#514.
This PR now only adds "idefics3" to VLLM_SUPPORTED_VLM and the
error message text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gaztrabisme
Copy link
Author

Thanks for the thorough testing @danielhanchen! All three points addressed:

  1. Hook workaround removed — moved the fix upstream to Fix requires_grad_pre_hook for kwargs-only forward methods unsloth-zoo#514. This PR is now just "idefics3" in the VLM list.
  2. Error message updated — Idefics3 added to the user-facing text.
  3. Extra blank lines cleaned up — gone with the workaround removal.

PR is now 2 lines changed (1 file).

@gaztrabisme
Copy link
Author

@mmathew23 Could you help review this and my PR over at unsloth-zoo? Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants