PaddlePaddle · qingqing01 · Sep 27, 2024 · Aug 22, 2024 · Aug 23, 2024 · Aug 27, 2024
diff --git a/llm/docs/predict/best_practices.md b/llm/docs/predict/best_practices.md
@@ -4,18 +4,17 @@ PaddleNLP 提供了多种环境变量，用于优化推理性能和资源使用
 
 **GEMM 优化**
 
-- `FLAGS_enable_blaslt_global_search`：int8 gemm是否开启全局调优，默认值为0，表示不开启。设置为1，PaddleNLP 会在推理过程中使用`FLAGS_cublaslt_device_best_config`中记录的最优GEMM配置。
+- `FLAGS_enable_blaslt_global_search`：int8 gemm 是否开启全局调优，默认值为0，表示不开启。设置为1，PaddleNLP 会在推理过程中动态搜索最优的 gemm 算法。推理 A8W8模型时使用此 FLAG 会获得更优的性能。
 
-- `FLAGS_cublaslt_device_best_config`：指向性能最优的int8 gemm配置文件，默认值为""。配置文件可以通过`PaddleNLP/csrc/generation/test_tune_cublaslt_gemm.py`产出，该脚本会自动搜索当前输入大小下cuBLASLt提供的最优gemm配置并将结果记录下来。
+
+- `FLAGS_cublaslt_device_best_config`：在 FLAGS_enable_blaslt_global_search 设为1的前提下，使用`FLAGS_cublaslt_device_best_config`来指定离线调优出的 int8 gemm 配置文件，默认值为""。配置文件可以通过`PaddleNLP/csrc/utils/tune_cublaslt_int8_gemm.py`产出，该脚本会自动搜索当前输入大小下 cuBLASLt 提供的最优 gemm 配置并将结果记录下来，需要注意的是不同的 CUDA 版本需要分别 tune。推理 A8W8模型并且 FLAGS_enable_blaslt_global_search 设为1时使用此 FLAG 会获得更优的性能。
 
 **GQA 优化**
 
-- `FLAGS_use_xqa_optim`：gpa是否开启xqa优化，默认值为0，表示不开启。gqa模型（如llama3/3.1、qwen2）设为1性能会更好。
+- `FLAGS_use_xqa_optim`：gpa 是否开启 xqa 优化，默认值为0，表示不开启。gqa 模型（如 llama3/3.1、qwen2）设为1性能会更好。
 
 **显存优化**
 
-- `FLAGS_allocator_strategy`：显存管理策略，默认值为 `auto_growth`。可优先设为`naive_best_fit`，若显存oom可配置为`auto_growth`。
-
-- `FLAGS_fraction_of_gpu_memory_to_use`：GPU显存使用率，默认值为0.9。设置为0.9即可。
+- `FLAGS_fraction_of_gpu_memory_to_use`：GPU 显存使用率，默认值为0.9。设置为0.9即可。
 
 - `FLAGS_gemm_use_half_precision_compute_type`：是否使用半精度浮点数计算，默认值为0。设置为0即可。
diff --git a/llm/docs/predict/llama.md b/llm/docs/predict/llama.md
@@ -1,34 +1,43 @@
 # LLaMA
 
-本文档展示了如何在 PaddleNLP中构建和运行[LLaMA](https://llama.meta.com/) 系列大模型。
+本文档展示了如何在 PaddleNLP 中构建和运行[LLaMA](https://llama.meta.com/) 系列大模型。
 
 ## 模型介绍
 
 * LLaMA 系列大模型是由 Meta AI 发布的一个开放且高效的大型基础语言模型。
 
-* [Llama 2](https://llama.meta.com/llama2/)：2023年7月，Meta发布了Llama 2系列，有7B、13B、34B和70B四个版本。该版本实现了开源商用，降低了初创公司创建类似ChatGPT聊天机器人的成本。
+* [Llama 2](https://llama.meta.com/llama2/)：2023年7月，Meta 发布了 Llama 2系列，有7B、13B、34B 和70B 四个版本。该版本实现了开源商用，降低了初创公司创建类似 ChatGPT 聊天机器人的成本。
 
-* [Llama 3](https://llama.meta.com/)：2024年4月19日，Meta推出了Llama 3系列，包括8B和70B两个版本，400B的Llama-3还在训练中。该版本在多个基准测试中取得了全面进步，性能优异。
+* [Llama 3](https://llama.meta.com/)：2024年4月19日，Meta 推出了 Llama 3系列，包括8B 和70B 两个版本，400B 的 Llama-3还在训练中。该版本在多个基准测试中取得了全面进步，性能优异。
 
-* [Llama 3.1](https://llama.meta.com/)：2024年7月23日，Meta发布了Llama 3.1 8B、70B、405B模型，进一步提升了模型的性能和效率。
+* [Llama 3.1](https://llama.meta.com/)：2024年7月23日，Meta 发布了 Llama 3.1 8B、70B、405B 模型，进一步提升了模型的性能和效率。
 
-## 模型支持
+## 已验证的模型
 
-|              Model             | 
-| :----------------------------: |
-|   meta-llama/Llama-2-7b(-chat)   |
-|   meta-llama/Llama-2-13b(-chat)   |
-|   meta-llama/Llama-2-70b(-chat)    |
-|   meta-llama/Meta-Llama-3-8B(-Instruct) |
-|   meta-llama/Meta-Llama-3-70B(-Instruct)    |
-|   meta-llama/Meta-Llama-3.1-8B(-Instruct)     |
-|   meta-llama/Meta-Llama-3.1-70B(-Instruct)     |
-|   meta-llama/Meta-Llama-3.1-405B(-Instruct)     |
+|Model|
+|:-|
+|meta-llama/Llama-2-7b-chat|
+|meta-llama/Llama-2-13b-chat|
+|meta-llama/Llama-2-70b-chat|
+|meta-llama/Meta-Llama-3-8B-Instruct|
+|meta-llama/Meta-Llama-3-70B-Instruct|
+|meta-llama/Meta-Llama-3.1-8B-Instruct|
+|meta-llama/Meta-Llama-3.1-70B-Instruct|
+|meta-llama/Meta-Llama-3.1-405B-Instruct|
+
+## 已验证的预量化模型
+
+|Model|
+|:-|
+|meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8|
+|meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8|
+|meta-llama/Meta-Llama-3.1-8B-Instruct-A8W8C8|
+|meta-llama/Meta-Llama-3.1-8B-Instruct-A8W8-FP8|
 
 
 ## 模型推理
 
-以meta-llama/Meta-Llama-3-8B-Instruct单卡和meta-llama/Meta-Llama-3.1-405B-Instruct多卡为例。
+以 meta-llama/Meta-Llama-3-8B-Instruct 单卡和 meta-llama/Meta-Llama-3.1-405B-Instruct 多卡为例。
 
 BF16推理
 
@@ -57,7 +66,7 @@ python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-I
 python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type weight_only_int8
 ```
 
-下面量化推理所需要的模型需要根据[大模型量化教程](../quantization.md)产出。
+下面量化推理所需要的模型需要根据[大模型量化教程](../quantization.md)产出，如 checkpoints/llama_ptq_ckpts，或者使用所提供的预先量化好的模型，如 meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8。
 
 INT8-A8W8推理
 
@@ -76,10 +85,10 @@ INT8-A8W8C8推理
 
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8 --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8C8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
 
 # 静态图推理
 python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
@@ -88,10 +97,10 @@ python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype
 FP8-A8W8推理
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path checkpoints/llama_ptq_ckpts --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/predictor.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8 --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path checkpoints/llama_ptq_ckpts --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct-A8W8-FP8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
 
 # 静态图推理
 python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
@@ -108,13 +117,12 @@ tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-405B-Instru
 generation_config = GenerationConfig.from_pretrained("meta-llama/Meta-Llama-3.1-405B-Instruct")
 ```
 
-这里通过--use_fake_parameter使用fake parameters，如需要推理正确的量化模型，请自行参考[大模型量化教程](../quantization.md)进行量化。
+这里通过--use_fake_parameter 使用 fake parameters，如需要推理正确的量化模型，请自行参考[大模型量化教程](../quantization.md)进行量化。
 
 ```shell
 # 导出模型 (可在predict/export_model.py中设置paddle.set_device("cpu")，通过内存导出模型)
 python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/export_model.py --model_name_or_path meta-llama/Meta-Llama-3.1-405B-Instruct --output_path /path/to/a8w8c8_tp8 --inference_model 1 --block_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static --use_fake_parameter 1
 
 # 推理
-python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/predictor.py --model_name_or_path /path/to/a8w8c8_tp8 --mode static --inference_model 1 --block_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static 
+python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" predict/predictor.py --model_name_or_path /path/to/a8w8c8_tp8 --mode static --inference_model 1 --block_attn 1 --dtype bfloat16 --quant_type a8w8 --cachekv_int8_type static
 ```
-
diff --git a/llm/docs/predict/mixtral.md b/llm/docs/predict/mixtral.md
@@ -1,23 +1,23 @@
 # Mixtral
 
-本文档展示了如何在 PaddleNLP中构建和运行 [Mxtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) 模型。
+本文档展示了如何在 PaddleNLP 中构建和运行 [Mxtral](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) 模型。
 
 ## 模型介绍
 
 
-* [Mistral系列](https://arxiv.org/abs/2310.06825) 是Mistral AI研发的基座大模型，使用了分组查询注意力和滑动窗口注意力机制来提高模型性能表现和推理速度，包括7B不同规模的Base和Instruct模型。
-* [Mixtral系列](https://arxiv.org/abs/2401.04088) 是Mistral AI采用MoE(Mixture of Experts)架构设计的基座大模型，在大多数基准测试中优于同级别的llama模型，MoE结合了多个专家模型的优势来解决问题，在推理中仅需激活少量专家就可以达到非常好的效果，相比于传统大模型减少了较多的计算量；目前开源模型包括8x7B和8x22B两种不同规模的Base和Instruct模型。
+* [Mistral 系列](https://arxiv.org/abs/2310.06825) 是 Mistral AI 研发的基座大模型，使用了分组查询注意力和滑动窗口注意力机制来提高模型性能表现和推理速度，包括7B 不同规模的 Base 和 Instruct 模型。
+* [Mixtral 系列](https://arxiv.org/abs/2401.04088) 是 Mistral AI 采用 MoE(Mixture of Experts)架构设计的基座大模型，在大多数基准测试中优于同级别的 llama 模型，MoE 结合了多个专家模型的优势来解决问题，在推理中仅需激活少量专家就可以达到非常好的效果，相比于传统大模型减少了较多的计算量；目前开源模型包括8x7B 和8x22B 两种不同规模的 Base 和 Instruct 模型。
 
-## 模型支持
+## 已验证的模型
 
-|              Model              |
-| :-----------------------------: |
-| mistralai/Mixtral-8x7B-v0.1(-Instruct) |
+|Model|
+|:-|
+|mistralai/Mixtral-8x7B-v0.1-Instruct|
 
 
 ## 模型推理
 
-下面以Mixtral-8x7B-Instruct-v0.1两卡为例介绍整体推理流程。
+下面以 Mixtral-8x7B-Instruct-v0.1两卡为例介绍整体推理流程。
 
 BF16推理
 
@@ -97,4 +97,4 @@ python -m paddle.distributed.launch \
     --mode "static" \
     --inference_model \
     --block_attn
-```
+```
diff --git a/llm/docs/predict/qwen.md b/llm/docs/predict/qwen.md
@@ -1,30 +1,39 @@
 # Qwen
 
-本文档展示了如何在 PaddleNLP中构建和运行[Qwen](https://huggingface.co/Qwen) 系列大模型。
+本文档展示了如何在 PaddleNLP 中构建和运行[Qwen](https://huggingface.co/Qwen) 系列大模型。
 
 ## 模型介绍
 
-* [通义千问（Qwen）](https://arxiv.org/abs/2205.01068) 是阿里云研发的通义千问大模型系列的模型, 包括 Qwen-1.8B、Qwen-7B、Qwen-14B和Qwen-72B等4个规模。Qwen 是基于 Transformer 的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。
+* [通义千问（Qwen）](https://arxiv.org/abs/2205.01068) 是阿里云研发的通义千问大模型系列的模型, 包括 Qwen-1.8B、Qwen-7B、Qwen-14B 和 Qwen-72B 等4个规模。Qwen 是基于 Transformer 的大语言模型, 在超大规模的预训练数据上进行训练得到。预训练数据类型多样，覆盖广泛，包括大量网络文本、专业书籍、代码等。
 
-* [通义千问（Qwen1.5）](https://qwenlm.github.io/blog/qwen1.5/) 是阿里云研发的通义千问系列模型升级版。Qwen1.5包括0.5B、1.8B、4B、7B、14B、32B、72B和110B共计8个不同规模的Base和Chat模型。
+* [通义千问（Qwen1.5）](https://qwenlm.github.io/blog/qwen1.5/) 是阿里云研发的通义千问系列模型升级版。Qwen1.5包括0.5B、1.8B、4B、7B、14B、32B、72B 和110B 共计8个不同规模的 Base 和 Chat 模型。
 
-* [通义千问（Qwen2）](https://qwenlm.github.io/blog/qwen2/) 是阿里云研发的通义千问系列模型升级版。Qwen2包括Qwen2-0.5B、Qwen2-1.5B、Qwen2-7B、Qwen2-57B-A14B 以及Qwen2-72B 共计5个不同规模的 Base 和 Instruct 模型。 
+* [通义千问（Qwen2）](https://qwenlm.github.io/blog/qwen2/) 是阿里云研发的通义千问系列模型升级版。Qwen2包括 Qwen2-0.5B、Qwen2-1.5B、Qwen2-7B、Qwen2-57B-A14B 以及 Qwen2-72B 共计5个不同规模的 Base 和 Instruct 模型。
 
-* [通义千问（Qwen-MoE）](https://qwenlm.github.io/blog/qwen2/) 是阿里云研发的通义千问系列模型升级版。Qwen-MoE包括Qwen1.5-MoE-A2.7B 以及 Qwen2-57B-A14B 共计2个不同规模的 Base、Chat 和 Instruct 模型。 
+* [通义千问（Qwen-MoE）](https://qwenlm.github.io/blog/qwen2/) 是阿里云研发的通义千问系列模型升级版。Qwen-MoE 包括 Qwen1.5-MoE-A2.7B 以及 Qwen2-57B-A14B 共计2个不同规模的 Base、Chat 和 Instruct 模型。
 
-## 模型支持
+## 已验证的模型
 
-|              Model             | 
-| :----------------------------: |
-|   Qwen/Qwen2-0.5B(-Instruct)   |
-|   Qwen/Qwen2-1.5B(-Instruct)   |
-|    Qwen/Qwen2-7B(-Instruct)    |
-|  Qwen/Qwen1.5-MoE-A2.7B(-Chat) |
+|Model|
+|:-|
+|Qwen/Qwen2-0.5B-Instruct|
+|Qwen/Qwen2-1.5B-Instruct|
+|Qwen/Qwen2-7B-Instruct|
+|Qwen/Qwen1.5-MoE-A2.7B-Chat|
+|Qwen/Qwen2-57B-A14B-Instruct|
 
+## 已验证的预量化模型
+
+|Model|
+|:-|
+|Qwen/Qwen2-1.5B-Instruct-A8W8C8|
+|Qwen/Qwen2-1.5B-Instruct-A8W8-FP8|
+|Qwen/Qwen2-7B-Instruct-A8W8C8|
+|Qwen/Qwen2-7B-Instruct-A8W8-FP8|
 
 ## 模型推理
 
-以Qwen/Qwen2-1.5B-Instruct为例。
+以 Qwen/Qwen2-1.5B-Instruct 为例。
 
 BF16推理
 
@@ -53,7 +62,7 @@ python predict/export_model.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct --o
 python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type weight_only_int8
 ```
 
-下面量化推理所需要的模型需要根据[大模型量化教程](../quantization.md)产出。
+下面量化推理所需要的模型需要根据[大模型量化教程](../quantization.md)产出，如 checkpoints/qwen_ptq_ckpts，或者使用所提供的预先量化好的模型，如 Qwen/Qwen2-1.5B-Instruct-A8W8C8。
 
 INT8-A8W8推理
 
@@ -72,10 +81,10 @@ INT8-A8W8C8推理
 
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path checkpoints/qwen_ptq_ckpts --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/predictor.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct-A8W8C8 --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path checkpoints/qwen_ptq_ckpts --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
+python predict/export_model.py --model_name_or_path Qwen/Qwen2-1.5B-Instruct-A8W8C8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
 
 # 静态图推理
 python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8 --cachekv_int8_type static
@@ -84,10 +93,10 @@ python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype
 FP8-A8W8推理
 ```shell
 # 动态图推理
-python predict/predictor.py --model_name_or_path checkpoints/qwen_ptq_ckpts --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/predictor.py --model_name_or_path Qwen/Qwen2-7B-Instruct-A8W8-FP8 --dtype bfloat16 --mode dynamic --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
 
 # 动转静导出模型
-python predict/export_model.py --model_name_or_path checkpoints/qwen_ptq_ckpts --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
+python predict/export_model.py --model_name_or_path Qwen/Qwen2-7B-Instruct-A8W8-FP8 --output_path /path/to/exported_model --dtype bfloat16 --inference_model 1 --block_attn 1 --quant_type a8w8_fp8
 
 # 静态图推理
 python predict/predictor.py --model_name_or_path /path/to/exported_model --dtype bfloat16 --mode static --inference_model 1 --block_attn 1 --quant_type a8w8_fp8

diff --git a/llm/predict/predictor.py b/llm/predict/predictor.py
@@ -886,12 +886,8 @@ def init_model_inputs(self, config: PredictorArgument):
                 ]
             )
             # self.model_inputs["src_mask/tgt_mask"] is read only, will not be updated!
-            src_mask = (
-                alibi_encoder + (1 - src_mask) * paddle.finfo(self.dtype).min
-            ).cast(self.dtype)
-            tgt_mask = (
-                alibi_decoder + (1 - tgt_mask) * paddle.finfo(self.dtype).min
-            ).cast(self.dtype)
+            src_mask = (alibi_encoder + (1 - src_mask) * paddle.finfo(self.dtype).min).cast(self.dtype)
+            tgt_mask = (alibi_decoder + (1 - tgt_mask) * paddle.finfo(self.dtype).min).cast(self.dtype)
             self.model_inputs["rope_emb"] = paddle.concat([src_mask.reshape([-1]), tgt_mask.reshape([-1])])
 
     def _preprocess(self, input_text: list[str]):

diff --git a/paddlenlp/experimental/transformers/bloom/modeling.py b/paddlenlp/experimental/transformers/bloom/modeling.py
@@ -293,6 +293,7 @@
 
     @paddle.no_grad()
     def set_state_dict(self, state_dict, use_structured_name=True):
+        self.transformer_block.init_weight()
         for k, v in state_dict.items():
             if k.find("word_embeddings.weight") >= 0:
                 self.word_embeddings.weight.set_value(paddle.to_tensor(v))

diff --git a/paddlenlp/experimental/transformers/chatglm/modeling.py b/paddlenlp/experimental/transformers/chatglm/modeling.py
@@ -377,6 +377,7 @@
 
     @paddle.no_grad()
     def set_state_dict(self, state_dict, use_structured_name=True):
+        self.transformer_block.init_weight()
         dtype = paddle.get_default_dtype()
         config = self.config
         embed_dim = config.hidden_size

diff --git a/paddlenlp/experimental/transformers/chatglm_v2/modeling.py b/paddlenlp/experimental/transformers/chatglm_v2/modeling.py
@@ -290,6 +290,8 @@
 
     @paddle.no_grad()
     def set_state_dict(self, state_dict):
+        self.transformer_block.init_weight()
+
         # find the real name.
         def key(name):
             result_list = []