Q bf16 kv fp8 for MLA #2144

Open

Labels

opened

on Nov 26, 2025

Hi!

In sglang we cast the entire kv cache for each layer to bf16 from fp8 which results in performance degradation on H100.

https://github.com/sgl-project/sglang/blob/dbab5d50a3d6d1fc0169c04a33f1b6dcefdeec04/python/sglang/srt/layers/attention/flashinfer_mla_backend.py#L639

since flashinfer unsupoorts mixed types for q and kv for MLA

Can we support it? Trivial kv scales would be enough to start

Metadata

Assignees

No one assigned

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests