Fast LLM speculative inference server for consumer hardware.
-
Updated
Jul 1, 2026 - C++
Fast LLM speculative inference server for consumer hardware.
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
DFlash & TurboQuant in llama.cpp with up to 3x faster generation and 7.5x more KV cache in same VRAM
Fully uncensored, capability-enhanced abliteration of Qwen3.6-27B. NVFP4 + z-lab DFlash speculative decoding (n=12) on the unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest container, tuned for long-context draft acceptance on DGX Spark. 6 HF variants (BF16/NVFP4/MTP/MTP-XS), docker-compose, and QuickStart.
SNDR Core Engine (Genesis) — vLLM runtime patch-overlay for Qwen3.6 + Gemma4 on consumer NVIDIA (Ampere sm_86, 2× A5000/3090). Qwen3.6-35B-A3B FP8 ~240 tok/s, 27B-int4 hybrid GDN+Mamba, Gemma4 26B/31B AWQ, 256K ctx. 321 patches: TurboQuant k8v4 KV, MTP/DFlash spec-decode, FULL cudagraph, hybrid GDN. vLLM pin dev424 + Control Center GUI.
vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.
llama.cpp fork optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1) — TurboQuant weights + KV, NVFP4, DFlash MTP
Local AI workstation — discover, run, chat, benchmark, and generate images from open-weight models. DFlash/DDTree speculative decoding, TurboQuant & TriAttention cache compression strategies, MLX + llama.cpp + vLLM + MTPLX backends.
Run Qwen 3.6-27B AWQ-INT4 models with DFlash speculative decoding on AMD Strix Halo hardware using vLLM for high-throughput inference.
Qwen3.6-27B BF16+DFlash 13-config parameter sweep on repne/vllm:v2. Stage A (3x3 buffer/graph) + Stage B (num_speculative_tokens) + Quality (HumanEval/MBPP). 7h, 421 problems, 195 cells.
CLI for building and testing DFlash-style speculative decoding draft models.
GGUF-native DFlash speculative decoding runtime for local models
vLLM v0.21 + DFlash + thinking_token_budget for Gemma 4 & Qwen 3.6 on Blackwell GB10 (sm_121a / sm_120)
Reproducible efficient-inference stack for Qwen3.5-4B (AdaptFM Efficient Qwen Competition): GPTQ W4A16 g128, untied W8 lm_head, DFlash speculative decoding, and per-step vLLM latency optimizations. 7.745× average latency speedup with all quality gates passing.
Add a description, image, and links to the dflash topic page so that developers can more easily learn about it.
To associate your repository with the dflash topic, visit your repo's landing page and select "manage topics."