IBM Granite 4.0
How to run IBM Granite-4.0 with Unsloth GGUFs on llama.cpp, Ollama and how to fine-tune!
IBM releases Granite-4.0 models with 3 sizes including Nano (350M & 1B), Micro (3B), Tiny (7B/1B active) and Small (32B/9B active). Trained on 15T tokens, IBM’s new Hybrid (H) Mamba architecture enables Granite-4.0 models to run faster with lower memory use.
Learn how to run Unsloth Granite-4.0 Dynamic GGUFs or fine-tune/RL the model. You can fine-tune Granite-4.0 with our free Colab notebook for a support agent use-case.
Unsloth Granite-4.0 uploads:
You can also view our Granite-4.0 collection for all uploads including Dynamic Float8 quants etc.
Granite-4.0 Models Explanations:
Nano and H-Nano: The 350M and 1B models offer strong instruction-following abilities, enabling advanced on-device and edge AI and research/fine-tuning applications.
H-Small (MoE): Enterprise workhorse for daily tasks, supports multiple long-context sessions on entry GPUs like L40S (32B total, 9B active).
H-Tiny (MoE): Fast, cost-efficient for high-volume, low-complexity tasks; optimized for local and edge use (7B total, 1B active).
H-Micro (Dense): Lightweight, efficient for high-volume, low-complexity workloads; ideal for local and edge deployment (3B total).
Micro (Dense): Alternative dense option when Mamba2 isn’t fully supported (3B total).
Run Granite-4.0 Tutorials
⚙️ Recommended Inference Settings
IBM recommends these settings:
temperature=0.0, top_p=1.0, top_k=0
Temperature of 0.0
Top_K = 0
Top_P = 1.0
Recommended minimum context: 16,384
Maximum context length window: 131,072 (128K context)
Chat template:
🦙 Ollama: Run Granite-4.0 Tutorial
Install
ollamaif you haven't already!
Run the model! Note you can call
ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) inparamsin our Hugging Face upload! You can change the model name 'granite-4.0-h-small-GGUF' to any Granite model like 'granite-4.0-h-micro:Q8_K_XL'.
📖 llama.cpp: Run Granite-4.0 Tutorial
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
If you want to use
llama.cppdirectly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also download via Hugging Face (point 3). This is similar toollama run
OR download the model via (after installing
pip install huggingface_hub hf_transfer). You can choose Q4_K_M, or other quantized versions (like BF16 full precision).
Run Unsloth's Flappy Bird test
Edit
--threads 32for the number of CPU threads,--ctx-size 16384for context length (Granite-4.0 supports 128K context length!),--n-gpu-layers 99for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.For conversation mode:
🐋 Docker: Run Granite-4.0 Tutorial
If you already have Docker desktop, all your need to do is run the command below and you're done:
🦥 Fine-tuning Granite-4.0 in Unsloth
Unsloth now supports all Granite 4.0 models including nano, micro, tiny and small for fine-tuning. Training is 2x faster, use 50% less VRAM and supports 6x longer context lengths. Granite-4.0 micro and tiny fit comfortably in a 15GB VRAM T4 GPU.
Granite-4.0 free fine-tuning notebook
Granite-4.0-350M fine-tuning notebook
This notebook trains a model to become a Support Agent that understands customer interactions, complete with analysis and recommendations. This setup allows you to train a bot that provides real-time assistance to support agents.
We also show you how to train a model using data stored in a Google Sheet.

Unsloth config for Granite-4.0:
If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:
Last updated
Was this helpful?

