Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it supported to quantize attention to fp8 with calibration? #1158

Open
YSF-A opened this issue Feb 16, 2025 · 2 comments
Open

Is it supported to quantize attention to fp8 with calibration? #1158

YSF-A opened this issue Feb 16, 2025 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@YSF-A
Copy link

YSF-A commented Feb 16, 2025

Hi, I would like to know is it supported to quantize attention to fp8 with calibration?
Thanks.

@dsikka
Copy link
Collaborator

dsikka commented Feb 16, 2025

Hi @YSF-A - are you trying to quantize the outputs from the attention block or particular layers in the attention block?

@dsikka dsikka self-assigned this Feb 16, 2025
@dsikka dsikka added the question Further information is requested label Feb 16, 2025
@YSF-A
Copy link
Author

YSF-A commented Feb 17, 2025

Hi @YSF-A - are you trying to quantize the outputs from the attention block or particular layers in the attention block?

Thank you @dsikka
Because FP8 attention is supported in flashattention-v3. I would like to know if there is a way to get Q/K/V scale and therefore we can quant Q/K/V to FP8 with the scale in inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants