Skip to content

feat(diffusers): support large models and add Shutdown for dynamic reloading#8404

Open
JairoGuo wants to merge 2 commits intomudler:masterfrom
JairoGuo:fix/large-model-device-map-support
Open

feat(diffusers): support large models and add Shutdown for dynamic reloading#8404
JairoGuo wants to merge 2 commits intomudler:masterfrom
JairoGuo:fix/large-model-device-map-support

Conversation

@JairoGuo
Copy link
Copy Markdown

@JairoGuo JairoGuo commented Feb 5, 2026

Summary

This PR adds two features to the diffusers backend:

  1. Multi-GPU support for large models - Enables loading models >80GB across multiple GPUs
  2. Shutdown method - Properly releases GPU memory for dynamic model reloading

Problem

  1. Very large models (e.g., Qwen-Image ~95GB) cause OOM when loading on a single GPU
  2. No way to release GPU memory without restarting the service, preventing dynamic LoRA switching

Solution

1. Multi-GPU Distribution (device_map)

When LowVRAM is enabled:

  • Add low_cpu_mem_usage=True and device_map="balanced" during loading
  • Skip enable_model_cpu_offload() (conflicts with device_map)
  • Skip .to(device) (conflicts with device_map)

2. Shutdown Method

Add Shutdown() that:

  • Releases pipeline, controlnet, and compel objects
  • Clears CUDA cache with torch.cuda.empty_cache()
  • Resets state flags

This enables dynamic LoRA switching:

# 1. Unload model
POST /backend/shutdown {"model": "qwen-image"}

# 2. Update config (change lora_adapters)

# 3. Request triggers reload with new config
POST /v1/images/generations {...}

Testing

- Model: Qwen-Image (~95GB)
- Hardware: NVIDIA H20 (96GB) x3
- Tested: Multi-LoRA loading, dynamic LoRA switching, GPU memory release
…stribution

When loading very large models (e.g., Qwen-Image ~95GB) on GPUs with limited
headroom, the model loads successfully but leaves no memory for inference.

This PR adds support for multi-GPU distribution via device_map when LowVRAM
is enabled:

1. Add low_cpu_mem_usage=True and device_map='balanced' during model loading
   to distribute large models across multiple GPUs

2. Skip enable_model_cpu_offload() when device_map is used, as they conflict
   with each other (ValueError: device mapping strategy doesn't allow
   enable_model_cpu_offload)

3. Skip .to(device) when device_map is used, as they also conflict
   (ValueError: device mapping strategy doesn't allow explicit device
   placement using to())

This enables running models like Qwen-Image on multi-GPU setups where a
single GPU doesn't have enough memory for both model weights and inference.

Tested with:
- Qwen-Image (~95GB) on 3x NVIDIA H20 (96GB each)
- Configuration: low_vram: true, pipeline_type: QwenImagePipeline
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 5, 2026

Deploy Preview for localai ready!

Name Link
🔨 Latest commit e3a64e0
🔍 Latest deploy log https://app.netlify.com/projects/localai/deploys/698606690ad0a00008dba93f
😎 Deploy Preview https://deploy-preview-8404--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@JairoGuo JairoGuo changed the title fix(diffusers): support large models with device_map for multi-GPU distribution Feb 5, 2026
Add Shutdown method to the diffusers backend that properly releases GPU
memory when a model is unloaded. This enables dynamic model reloading
with different configurations (e.g., switching LoRA adapters) without
restarting the service.

The Shutdown method:
- Releases the pipeline, controlnet, and compel objects
- Clears CUDA cache with torch.cuda.empty_cache()
- Resets state flags (img2vid, txt2vid, ltx2_pipeline)

This works with LocalAI's existing /backend/shutdown API endpoint,
which terminates the gRPC process. The explicit cleanup ensures
GPU memory is properly released before process termination.

Tested with Qwen-Image (~95GB) on NVIDIA H20 GPUs.
@JairoGuo JairoGuo changed the title fix(diffusers): support large models with device_map for multi-GPU distribution Feb 5, 2026
def Health(self, request, context):
return backend_pb2.Reply(message=bytes("OK", 'utf-8'))

def Shutdown(self, request, context):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused?

@JairoGuo JairoGuo force-pushed the fix/large-model-device-map-support branch from 7b7dbd6 to e3a64e0 Compare February 6, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants