Skip to content

Q bf16 kv fp8 for MLA #2144

@akhoroshev

Description

@akhoroshev

Hi!

In sglang we cast the entire kv cache for each layer to bf16 from fp8 which results in performance degradation on H100.

https://github.com/sgl-project/sglang/blob/dbab5d50a3d6d1fc0169c04a33f1b6dcefdeec04/python/sglang/srt/layers/attention/flashinfer_mla_backend.py#L639

since flashinfer unsupoorts mixed types for q and kv for MLA

Can we support it? Trivial kv scales would be enough to start

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions