dflash

Here are 22 public repositories matching this topic...

Luce-Org / lucebox-hub

Fast LLM speculative inference server for consumer hardware.

spark kernel cuda cuda-kernels luce poolside rtx3090 llama-cpp local-ai qwen speculative-decoding dflash megakernel speculative-prefill pflash lucebox

Updated Jul 1, 2026
C++

Tencent / AngelSlim

Star

Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.

audio eagle quantization diffusion vlm llm qwen speculative-decoding llm-compression hunyuan deepseek fp4 dflash

Updated Jul 1, 2026
Python

Anbeeld / beellama.cpp

Star

DFlash & TurboQuant in llama.cpp with up to 3x faster generation and 7.5x more KV cache in same VRAM

inference quantization kv-cache llm llm-serving llama-cpp ggml llm-inference speculative-decoding dflash turboquant

Updated Jun 17, 2026
C++

AEON-7 / Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

Star

Fully uncensored, capability-enhanced abliteration of Qwen3.6-27B. NVFP4 + z-lab DFlash speculative decoding (n=12) on the unified ghcr.io/aeon-7/aeon-vllm-ultimate:latest container, tuned for long-context draft acceptance on DGX Spark. 6 HF variants (BF16/NVFP4/MTP/MTP-XS), docker-compose, and QuickStart.

quantization uncensored blackwell llm vllm qwen speculative-decoding abliteration qwen3 nvfp4 dgx-spark dflash

Updated Jun 28, 2026
Python

Sandermage / sndr_core_engine

Star

SNDR Core Engine (Genesis) — vLLM runtime patch-overlay for Qwen3.6 + Gemma4 on consumer NVIDIA (Ampere sm_86, 2× A5000/3090). Qwen3.6-35B-A3B FP8 ~240 tok/s, 27B-int4 hybrid GDN+Mamba, Gemma4 26B/31B AWQ, 256K ctx. 321 patches: TurboQuant k8v4 KV, MTP/DFlash spec-decode, FULL cudagraph, hybrid GDN. vLLM pin dev424 + Control Center GUI.

Updated Jul 1, 2026
Python

AEON-7 / vllm-dflash

Star

DFlash vLLM for DGX Spark — Plug & Play Block-Diffusion Speculative Decoding

docker inference nvidia blackwell llm vllm qwen speculative-decoding block-diffusion nvfp4 dgx-spark dflash

Updated Jun 28, 2026
Python

hec-ovi / vllm-awq4-qwen

Star

vLLM Qwen 3.6-27B (AWQ-INT4) + DFlash speculative decoding on AMD Strix Halo (gfx1151 iGPU, 128 GB UMA, ROCm 7.13). 24.8 t/s single-stream, vision, tool calling, 256K context, OpenAI-compatible, Docker. Matches DGX Spark FP8+DFlash+MTP at a third of the cost. No CUDA.

docker rocm openai-api awq vllm llm-inference speculative-decoding multimodal-llm qwen3 gfx1151 ryzen-ai-max dflash amd-strix-halo rdna35 27b

Updated May 10, 2026
Python

croll83 / llama.cpp-dgx

Star

llama.cpp fork optimized for NVIDIA DGX Spark / GB10 (Blackwell, SM 12.1) — TurboQuant weights + KV, NVFP4, DFlash MTP

blackwell llama-cpp speculative-decoding gb10 nvfp4 dflash turboquant

Updated May 26, 2026
C++

cryptopoly / ChaosEngineAI

Sponsor

Star

Local AI workstation — discover, run, chat, benchmark, and generate images from open-weight models. DFlash/DDTree speculative decoding, TurboQuant & TriAttention cache compression strategies, MLX + llama.cpp + vLLM + MTPLX backends.

Updated Jul 1, 2026
Python

phuongncn / qwen3.6-27b-speedhack-gx10-dgx-spark

Star

Qwen3.6 27B × DFlash — 30-35 tok/s on NVIDIA DGX Spark (GB10) - LLama.Cpp

dgx llamacpp dflash qwen36

Updated May 2, 2026
C++

AEON-7 / Qwen3.6-27B-AEON-Ultimate-Uncensored-DDTree

Star

Experimental DDTree-on-vLLM research track for Qwen3.6 AEON Ultimate on DGX Spark / GB10.

blackwell vllm speculative-decoding gb10 nvfp4 dgx-spark dflash qwen36 ddtree

Updated Jun 28, 2026
Python

aphroditeformal93 / vllm-awq4-qwen

Star

Run Qwen 3.6-27B AWQ-INT4 models with DFlash speculative decoding on AMD Strix Halo hardware using vLLM for high-throughput inference.

docker rocm openai-api awq vllm llm-inference speculative-decoding multimodal-llm qwen3 gfx1151 ryzen-ai-max dflash amd-strix-halo rdna35 27b

Updated Jul 1, 2026
Python

jcartu / qwen-bench-2026-05-dflash-v2-sweep

Star

Qwen3.6-27B BF16+DFlash 13-config parameter sweep on repne/vllm:v2. Stage A (3x3 buffer/graph) + Stage B (num_speculative_tokens) + Quality (HumanEval/MBPP). 7h, 421 problems, 195 cells.

benchmark inference parameter-sweep blackwell bf16 vllm speculative-decoding qwen3 dflash qwen-bench

Updated May 11, 2026
Shell

DAWNCR0W / dflasher

Star

CLI for building and testing DFlash-style speculative decoding draft models.

cuda transformers mlx huggingface apple-silicon vllm llm-inference speculative-decoding draft-model dflash

Updated Jun 2, 2026
Python

am423 / dflash-robot

Star

GGUF-native DFlash speculative decoding runtime for local models

cuda llama-cpp ggml llm-inference gguf speculative-decoding dflash

Updated May 5, 2026
C++

davidzha712 / vllm-dflash-budget-gb10

Star

vLLM v0.21 + DFlash + thinking_token_budget for Gemma 4 & Qwen 3.6 on Blackwell GB10 (sm_121a / sm_120)

gemma reasoning blackwell vllm speculative-decoding nvfp4 dgx-spark cuda-graphs dflash gemma-4 thinking-budget

Updated May 26, 2026
Python

jcartu / qwen-bench-2026-05-11-v2-followup

Star

Study #4: FP8+MTP{3,5} speed on repne/vllm:v2 + max_tokens=8192 quality re-runs for BF16+DFlash n=8 and FP8+MTP=3. Follow-up to studies #2 and #3.

benchmark inference mtp blackwell humaneval vllm speculative-decoding qwen3 mbpp dflash qwen-bench

Updated May 11, 2026
Python

sai-samarth / qwen35-4b-fast-inference

Star

Reproducible efficient-inference stack for Qwen3.5-4B (AdaptFM Efficient Qwen Competition): GPTQ W4A16 g128, untied W8 lm_head, DFlash speculative decoding, and per-step vLLM latency optimizations. 7.745× average latency speedup with all quality gates passing.

quantization efficient-inference vllm gptq llm-inference qwen speculative-decoding dflash

Updated Jun 29, 2026
Python

jcartu / qwen36-27b-bf16-dflash-repne-vs-upstream

Star

Same BF16+DFlash config on Repne fork vs upstream vLLM v0.20.1, dual RTX PRO 6000 Blackwell. Upstream's dflash collapses at long context (5-6x slower at 131K).

benchmark blackwell bf16 vllm qwen speculative-decoding dflash

Updated May 7, 2026

jcartu / repne-dflash-newimage

Star

benchmark blackwell vllm qwen dflash

Updated May 7, 2026

Improve this page

Add a description, image, and links to the dflash topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the dflash topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dflash

Here are 22 public repositories matching this topic...

Luce-Org / lucebox-hub

Tencent / AngelSlim

Anbeeld / beellama.cpp

AEON-7 / Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash

Sandermage / sndr_core_engine

AEON-7 / vllm-dflash

hec-ovi / vllm-awq4-qwen

croll83 / llama.cpp-dgx

cryptopoly / ChaosEngineAI

phuongncn / qwen3.6-27b-speedhack-gx10-dgx-spark

AEON-7 / Qwen3.6-27B-AEON-Ultimate-Uncensored-DDTree

aphroditeformal93 / vllm-awq4-qwen

jcartu / qwen-bench-2026-05-dflash-v2-sweep

DAWNCR0W / dflasher

am423 / dflash-robot

davidzha712 / vllm-dflash-budget-gb10

jcartu / qwen-bench-2026-05-11-v2-followup

sai-samarth / qwen35-4b-fast-inference

jcartu / qwen36-27b-bf16-dflash-repne-vs-upstream

jcartu / repne-dflash-newimage

Improve this page

Add this topic to your repo