Is it supported to quantize attention to fp8 with calibration? #1158

YSF-A · 2025-02-16T12:39:17Z

Hi, I would like to know is it supported to quantize attention to fp8 with calibration?
Thanks.

dsikka · 2025-02-16T14:49:06Z

Hi @YSF-A - are you trying to quantize the outputs from the attention block or particular layers in the attention block?

YSF-A · 2025-02-17T03:29:33Z

Hi @YSF-A - are you trying to quantize the outputs from the attention block or particular layers in the attention block?

Thank you @dsikka
Because FP8 attention is supported in flashattention-v3. I would like to know if there is a way to get Q/K/V scale and therefore we can quant Q/K/V to FP8 with the scale in inference.

dsikka self-assigned this Feb 16, 2025

dsikka added the question Further information is requested label Feb 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it supported to quantize attention to fp8 with calibration? #1158

Is it supported to quantize attention to fp8 with calibration? #1158

YSF-A commented Feb 16, 2025

dsikka commented Feb 16, 2025

YSF-A commented Feb 17, 2025 •

edited

Loading

Is it supported to quantize attention to fp8 with calibration? #1158

Is it supported to quantize attention to fp8 with calibration? #1158

Comments

YSF-A commented Feb 16, 2025

dsikka commented Feb 16, 2025

YSF-A commented Feb 17, 2025 • edited Loading

YSF-A commented Feb 17, 2025 •

edited

Loading