IBM Granite 4.0

How to run IBM Granite-4.0 with Unsloth GGUFs on llama.cpp, Ollama and how to fine-tune!

IBM releases Granite-4.0 models with 3 sizes including Nano (350M & 1B), Micro (3B), Tiny (7B/1B active) and Small (32B/9B active). Trained on 15T tokens, IBM’s new Hybrid (H) Mamba architecture enables Granite-4.0 models to run faster with lower memory use.

Learn how to run Unsloth Granite-4.0 Dynamic GGUFs or fine-tune/RL the model. You can fine-tune Granite-4.0 with our free Colab notebook for a support agent use-case.

Running TutorialFine-tuning Tutorial

Unsloth Granite-4.0 uploads:

Dynamic GGUFs

Dynamic 4-bit + FP8

16-bit Instruct

Dynamic 4-bit Instruct:

FP8 Dynamic:

You can also view our Granite-4.0 collection for all uploads including Dynamic Float8 quants etc.

Granite-4.0 Models Explanations:

Nano and H-Nano: The 350M and 1B models offer strong instruction-following abilities, enabling advanced on-device and edge AI and research/fine-tuning applications.
H-Small (MoE): Enterprise workhorse for daily tasks, supports multiple long-context sessions on entry GPUs like L40S (32B total, 9B active).
H-Tiny (MoE): Fast, cost-efficient for high-volume, low-complexity tasks; optimized for local and edge use (7B total, 1B active).
H-Micro (Dense): Lightweight, efficient for high-volume, low-complexity workloads; ideal for local and edge deployment (3B total).
Micro (Dense): Alternative dense option when Mamba2 isn’t fully supported (3B total).

Run Granite-4.0 Tutorials

⚙️ Recommended Inference Settings

IBM recommends these settings:

temperature=0.0, top_p=1.0, top_k=0

Temperature of 0.0
Top_K = 0
Top_P = 1.0
Recommended minimum context: 16,384
Maximum context length window: 131,072 (128K context)

Chat template:

🦙 Ollama: Run Granite-4.0 Tutorial

Install ollama if you haven't already!

Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload! You can change the model name 'granite-4.0-h-small-GGUF' to any Granite model like 'granite-4.0-h-micro:Q8_K_XL'.

📖 llama.cpp: Run Granite-4.0 Tutorial

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar to ollama run

OR download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).

Run Unsloth's Flappy Bird test
Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length (Granite-4.0 supports 128K context length!), --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.
For conversation mode:

🐋 Docker: Run Granite-4.0 Tutorial

If you already have Docker desktop, all your need to do is run the command below and you're done:

🦥 Fine-tuning Granite-4.0 in Unsloth

Unsloth now supports all Granite 4.0 models including nano, micro, tiny and small for fine-tuning. Training is 2x faster, use 50% less VRAM and supports 6x longer context lengths. Granite-4.0 micro and tiny fit comfortably in a 15GB VRAM T4 GPU.

Granite-4.0 free fine-tuning notebook
Granite-4.0-350M fine-tuning notebook

This notebook trains a model to become a Support Agent that understands customer interactions, complete with analysis and recommendations. This setup allows you to train a bot that provides real-time assistance to support agents.

We also show you how to train a model using data stored in a Google Sheet.

Unsloth config for Granite-4.0:

If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

PreviousMagistral NextLlama 4

Last updated 8 days ago

Was this helpful?