📙Devstral 2 - How to Run Guide
Guide for local running Mistral Devstral 2 models: 123B-Instruct-2512 and Small-2-24B-Instruct-2512.
Devstral 2 are Mistral’s new coding and agentic LLMs for software engineering, available in 24B and 123B sizes. The 123B model achieves SOTA in SWE-bench, coding, tool-calling and agent use-cases. The 24B model fits in 25GB RAM/VRAM and 123B fits in 128GB.
13th December 2025 Update
We’ve resolved issues in Devstral’s chat template, and results should be significantly better. The 24B & 123B have been updated. Also install the latest llama.cpp as at 13th Dec 2025!
Devstral 2 supports vision capabilities, a 256k context window and uses the same architecture as Ministral 3. You can now run and fine-tune both models locally with Unsloth.
All Devstral 2 uploads use our Unsloth Dynamic 2.0 methodology, delivering the best performance on Aider Polyglot and 5-shot MMLU benchmarks.
Devstral 2 - Unsloth Dynamic GGUFs:
🖥️ Running Devstral 2
See our step-by-step guides for running Devstral 24B and the large Devstral 123B models. Both models support vision support but currently vision is not supported in llama.cpp
⚙️ Usage Guide
Here are the recommended settings for inference:
Temperature ~0.15
Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)
Use
--jinjato enable the system prompt.Max context length = 262,144
Recommended minimum context: 16,384
Install the latest llama.cpp since a December 13th 2025 pull request fixes issues.
🎩Devstral-Small-2-24B
The full precision (Q8) Devstral-Small-2-24B GGUF will fit in 25GB RAM/VRAM. Text only for now.
✨ Run Devstral-Small-2-24B-Instruct-2512 in llama.cpp
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
If you want to use
llama.cppdirectly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also directly pull from Hugging Face:
Download the model via (after installing
pip install huggingface_hub hf_transfer). You can chooseUD_Q4_K_XLor other quantized versions.
Run the model in conversation mode:
👀Devstral and vision
To play with Devstral's image capabilities, let's first download a image like this FP8 Reinforcement Learning with Unsloth below:

We get the image via
wget https://unsloth.ai/cgi/image/fp8grpolarge_KharloZxEEaHAY2X97CEX.png?width=3840%26quality=80%26format=auto -O unsloth_fp8.pngwhich will save the image as "unsloth_fp8.png"Then load the image in via
/image unsloth_fp8.pngafter the model is loaded as seen below:
We then prompt it
Describe this imageand get the below:
🚚Devstral-2-123B
The full precision (Q8) Devstral-Small-2-123B GGUF will fit in 128GB RAM/VRAM. Text only for now.
✨ Run Devstral-2-123B-Instruct-2512 Tutorial
Obtain the latest
llama.cppon GitHub here. You can follow the build instructions below as well. Change-DGGML_CUDA=ONto-DGGML_CUDA=OFFif you don't have a GPU or just want CPU inference.
You can directly pull from HuggingFace via:
Download the model via (after installing
pip install huggingface_hub hf_transfer). You can chooseUD_Q4_K_XLor other quantized versions.
Run the model in conversation mode:
🦥 Fine-tuning Devstral 2 with Unsloth
Just like Ministral 3, Unsloth supports Devstral 2 fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Devstral 2 fits comfortably in a 24GB VRAM L4 GPU.
Unfortunately, Devstral 2 slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using our Kaggle notebook, which offers access to dual GPUs. Just change the notebook's Magistral model name to the unsloth/Devstral-Small-2-24B-Instruct-2512 model.
We made free Unsloth notebooks to fine-tune Ministral 3, and directly supports Devstral 2, since they share the same architecture! Change the name to use the desired model.
Ministral-3B-Instruct Vision notebook (vision) (Change model name to Devstral 2)
Ministral-3B-Instruct GRPO notebook (Change model name to Devstral 2)
Devstral Vision finetuning notebook
Devstral Sudoku GRPO RL notebook
😎Llama-server serving & deployment
To deploy Devstral 2 for production, we use llama-server In a new terminal say via tmux, deploy the model via:
When you run the above, you will get:

Then in a new terminal, after doing pip install openai, do:
Which will simply print 4.
🧰Tool Calling with Devstral 2 Tutorial
After following Llama-server serving & deployment we then can load up some tools and see Devstral in action! Let's make some tools - copy paste and execute them in Python.
We then ask a simple question from a random list of possible messages to test the model:
We then use the below functions (copy and paste and execute) which will parse the function calls automatically - Devstral 2 might make multiple in tandem!
And after 1 minute, we get:

Or in JSON form:
Last updated
Was this helpful?

