📙Devstral 2 - How to Run Guide

Guide for local running Mistral Devstral 2 models: 123B-Instruct-2512 and Small-2-24B-Instruct-2512.

Devstral 2 are Mistral’s new coding and agentic LLMs for software engineering, available in 24B and 123B sizes. The 123B model achieves SOTA in SWE-bench, coding, tool-calling and agent use-cases. The 24B model fits in 25GB RAM/VRAM and 123B fits in 128GB.

Devstral 2 supports vision capabilities, a 256k context window and uses the same architecture as Ministral 3. You can now run and fine-tune both models locally with Unsloth.

All Devstral 2 uploads use our Unsloth Dynamic 2.0 methodology, delivering the best performance on Aider Polyglot and 5-shot MMLU benchmarks.

Devstral-Small-2-24BDevstral-2-123B

Devstral 2 - Unsloth Dynamic GGUFs:

Devstral-Small-2-24B-Instruct-2512
Devstral-2-123B-Instruct-2512

🖥️ Running Devstral 2

See our step-by-step guides for running Devstral 24B and the large Devstral 123B models. Both models support vision support but currently vision is not supported in llama.cpp

⚙️ Usage Guide

Here are the recommended settings for inference:

  • Temperature ~0.15

  • Min_P of 0.01 (optional, but 0.01 works well, llama.cpp default is 0.1)

  • Use --jinja to enable the system prompt.

  • Max context length = 262,144

  • Recommended minimum context: 16,384

  • Install the latest llama.cpp since a December 13th 2025 pull request fixes issues.

🎩Devstral-Small-2-24B

The full precision (Q8) Devstral-Small-2-24B GGUF will fit in 25GB RAM/VRAM. Text only for now.

✨ Run Devstral-Small-2-24B-Instruct-2512 in llama.cpp

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. If you want to use llama.cpp directly to load models, you can do the below: (:Q4_K_XL) is the quantization type. You can also directly pull from Hugging Face:

  1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

  1. Run the model in conversation mode:

👀Devstral and vision

  1. To play with Devstral's image capabilities, let's first download a image like this FP8 Reinforcement Learning with Unsloth below:

  2. We get the image via wget https://unsloth.ai/cgi/image/fp8grpolarge_KharloZxEEaHAY2X97CEX.png?width=3840%26quality=80%26format=auto -O unsloth_fp8.png which will save the image as "unsloth_fp8.png"

  3. Then load the image in via /image unsloth_fp8.png after the model is loaded as seen below:

  4. We then prompt it Describe this image and get the below:

🚚Devstral-2-123B

The full precision (Q8) Devstral-Small-2-123B GGUF will fit in 128GB RAM/VRAM. Text only for now.

Run Devstral-2-123B-Instruct-2512 Tutorial

  1. Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

  1. You can directly pull from HuggingFace via:

  1. Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD_Q4_K_XL or other quantized versions.

  1. Run the model in conversation mode:

🦥 Fine-tuning Devstral 2 with Unsloth

Just like Ministral 3, Unsloth supports Devstral 2 fine-tuning. Training is 2x faster, use 70% less VRAM and supports 8x longer context lengths. Devstral 2 fits comfortably in a 24GB VRAM L4 GPU.

Unfortunately, Devstral 2 slightly exceeds the memory limits of a 16GB VRAM, so fine-tuning it for free on Google Colab isn't possible for now. However, you can fine-tune the model for free using our Kaggle notebook, which offers access to dual GPUs. Just change the notebook's Magistral model name to the unsloth/Devstral-Small-2-24B-Instruct-2512 model.

  • Ministral-3B-Instruct Vision notebook (vision) (Change model name to Devstral 2)

  • Ministral-3B-Instruct GRPO notebook (Change model name to Devstral 2)

Devstral Vision finetuning notebook

Devstral Sudoku GRPO RL notebook

😎Llama-server serving & deployment

To deploy Devstral 2 for production, we use llama-server In a new terminal say via tmux, deploy the model via:

When you run the above, you will get:

Then in a new terminal, after doing pip install openai, do:

Which will simply print 4.

🧰Tool Calling with Devstral 2 Tutorial

After following Llama-server serving & deployment we then can load up some tools and see Devstral in action! Let's make some tools - copy paste and execute them in Python.

We then ask a simple question from a random list of possible messages to test the model:

We then use the below functions (copy and paste and execute) which will parse the function calls automatically - Devstral 2 might make multiple in tandem!

And after 1 minute, we get:

Or in JSON form:

Last updated

Was this helpful?