Hi!
In sglang we cast the entire kv cache for each layer to bf16 from fp8 which results in performance degradation on H100.
https://github.com/sgl-project/sglang/blob/dbab5d50a3d6d1fc0169c04a33f1b6dcefdeec04/python/sglang/srt/layers/attention/flashinfer_mla_backend.py#L639
since flashinfer unsupoorts mixed types for q and kv for MLA
Can we support it? Trivial kv scales would be enough to start