[BugFix]add int8 cache dtype when using attention quantization #128

Angazenn · 2025-02-21T02:11:20Z

Ascend attention requires int8 kvcache dtype. It is used in initialization of CacheConfig:

if cache_config.cache_dtype == "auto":
    self.dtype = model_config.dtype
else:
    self.dtype = STR_DTYPE_TO_TORCH_DTYPE[cache_config.cache_dtype]

STR_DTYPE_TO_TORCH_DTYPE is defined in vllm.utils:

STR_DTYPE_TO_TORCH_DTYPE = {
    "half": torch.half,
    "bfloat16": torch.bfloat16,
    "float": torch.float,
    "fp8": torch.uint8,
    "fp8_e4m3": torch.uint8,
    "fp8_e5m2": torch.uint8,
}

Hence we need to update both cache_dtype and STR_DTYPE_TO_TORCH_DTYPE.

whx-sjtu · 2025-02-21T03:37:00Z

vllm_ascend/worker.py

+            # dtype string to torch.dtype. Hence we have to move these codes to here.
+            from vllm.utils import STR_DTYPE_TO_TORCH_DTYPE
+            self.cache_config.cache_dtype = 'int8'
+            STR_DTYPE_TO_TORCH_DTYPE['int8'] = torch.int8


Does vllm originally has entrypoints to resgister custom quant types into STR_DTYPE_TO_TORCH_DTYPE? If not, maybe we need to make this an issue to vllm? @wangxiyuan please check this.

No, we can create a pr to vllm later.

Currenlty. for this kind of change(Monkey patch), please move it to the patch module.

MengqingCao · 2025-02-21T06:36:36Z

CI failed due to the test path update in vLLM, will fix in #124

Signed-off-by: angazenn <[email protected]>

wangxiyuan · 2025-03-03T01:22:39Z

I think this PR need a rebase

whx-sjtu reviewed Feb 21, 2025

View reviewed changes

Angazenn force-pushed the bug_fix branch 2 times, most recently from f209600 to 6075a6e Compare February 22, 2025 02:13

angazenn added 3 commits February 25, 2025 21:05

adapt to new torch_npu interface

1d34892

Signed-off-by: angazenn <[email protected]>

fix yapf

164f3f8

Signed-off-by: angazenn <[email protected]>

add int8 dtype

225e8a9

Signed-off-by: angazenn <[email protected]>

Angazenn force-pushed the bug_fix branch from 6075a6e to 225e8a9 Compare March 1, 2025 06:42

github-actions bot added module:ops module:core labels Mar 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix]add int8 cache dtype when using attention quantization #128

[BugFix]add int8 cache dtype when using attention quantization #128

Angazenn commented Feb 21, 2025 •

edited

Loading

whx-sjtu Feb 21, 2025

wangxiyuan Feb 21, 2025

MengqingCao commented Feb 21, 2025

wangxiyuan commented Mar 3, 2025

[BugFix]add int8 cache dtype when using attention quantization #128

Are you sure you want to change the base?

[BugFix]add int8 cache dtype when using attention quantization #128

Conversation

Angazenn commented Feb 21, 2025 • edited Loading

whx-sjtu Feb 21, 2025

Choose a reason for hiding this comment

wangxiyuan Feb 21, 2025

Choose a reason for hiding this comment

MengqingCao commented Feb 21, 2025

wangxiyuan commented Mar 3, 2025

Angazenn commented Feb 21, 2025 •

edited

Loading