diff --git a/application/neural_search/README.md b/application/neural_search/README.md new file mode 100644 index 000000000000..539a751ed22e --- /dev/null +++ b/application/neural_search/README.md @@ -0,0 +1,287 @@ +# 手把手搭建一个语义检索系统 + +## 1. 场景概述 + +检索系统存在于我们日常使用的很多产品中,比如商品搜索系统、学术文献检索系等等,本方案提供了检索系统完整实现。限定场景是用户通过输入检索词 Query,快速在海量数据中查找相似文档。 + +所谓语义检索(也称基于向量的检索),是指检索系统不再拘泥于用户 Query 字面本身,而是能精准捕捉到用户 Query 后面的真正意图并以此来搜索,从而更准确地向用户返回最符合的结果。通过使用最先进的语义索引模型找到文本的向量表示,在高维向量空间中对它们进行索引,并度量查询向量与索引文档的相似程度,从而解决了关键词索引带来的缺陷。 + +例如下面两组文本 Pair,如果基于关键词去计算相似度,两组的相似度是相同的。而从实际语义上看,第一组相似度高于第二组。 + +``` +车头如何放置车牌 前牌照怎么装 +车头如何放置车牌 后牌照怎么装 +``` + +语义检索系统的关键就在于,采用语义而非关键词方式进行召回,达到更精准、更广泛得召回相似结果的目的。 + +## 2. 产品功能介绍 + +通常检索业务的数据都比较庞大,都会分为召回(索引)、排序两个环节。召回阶段主要是从至少千万级别的候选集合里面,筛选出相关的文档,这样候选集合的数目就会大大降低,在之后的排序阶段就可以使用一些复杂的模型做精细化或者个性化的排序。一般采用多路召回策略(例如关键词召回、热点召回、语义召回结合等),多路召回结果聚合后,经过统一的打分以后选出最优的 TopK 的结果。 + +### 2.1 系统特色 + ++ 低门槛 + + 手把手搭建起检索系统 + + 无需标注数据也能构建检索系统 + + 提供 训练、预测、ANN 引擎一站式能力 + ++ 效果好 + + 针对多种数据场景的专业方案 + + 仅有无监督数据: SimCSE + + 仅有有监督数据: InBatchNegative + + 兼具无监督数据 和 有监督数据:融合模型 + + 进一步优化方案: 面向领域的预训练 Domain-adaptive Pretraining ++ 性能快 + + 基于 Paddle Inference 快速抽取向量 + + 基于 Milvus 快速查询和高性能建库 + +### 2.2 功能架构 + +索引环节有两类方法:基于字面的关键词索引;语义索引。语义索引能够较好地表征语义信息,解决字面不相似但语义相似的情形。本系统给出的是语义索引方案,实际业务中可融合其他方案使用。下面就详细介绍整个方案的架构和功能。 + +#### 2.2.1 整体介绍 + + + +![系统流程图](./img/system_pipeline.png) +以上是nerual_search的系统流程图,其中左侧为召回环节,核心是语义向量抽取模块;右侧是排序环节,核心是排序模型。图中红色虚线框表示在线计算,黑色虚线框表示离线批量处理。下面我们分别介绍召回中的语义向量抽取模块,以及排序模型。 + + +#### 2.2.2 召回模块 + +召回模块需要从千万量级数据中快速召回候选数据。首先需要抽取语料库中文本的 Embedding,然后借助向量搜索引擎实现高效 ANN,从而实现候选集召回。 + +我们针对不同的数据情况推出三种语义索引方案,如下图所示,您可以参照此方案,快速建立语义索引: + +| ⭐️ 无监督数据 | ⭐️ 有监督数据 | **召回方案** | +| ------------ | ------------ | ------------ | +| 多 | 无 | SimCSE | +| 无 | 多 | In-batch Negatives| +| 有 | 有 | SimCSE+ In-batch Negatives | + +最基本的情况是只有无监督数据,我们推荐您使用 SimCSE 进行无监督训练;另一种方案是只有有监督数据,我们推荐您使用 In-batch Negatives 的方法进行有监督训练。 + +如果想进一步提升模型效果:还可以使用大规模业务数据,对预训练模型进行 Domain-adaptive Pretraining,训练完以后得到预训练模型,再进行无监督的 SimCSE。 + +此外,如果您同时拥有监督数据和无监督数据,我们推荐将两种方案结合使用,这样能训练出更加强大的语义索引模型。 + +#### 2.2.3 排序模块 + +排序模块基于前沿的预训练模型 ERNIE-Gram,训练 Pair-wise 语义匹配模型。召回模型负责从海量(千万级)候选文本中快速(毫秒级)筛选出与 Query 相关性较高的 TopK Doc,排序模型会在召回模型筛选出的 TopK Doc 结果基础之上针对每一个 (Query, Doc) Pair 对进行两两匹配计算相关性,排序效果更精准。 + +## 3. 文献检索实践 + +### 3.1 技术方案和评估指标 + +#### 3.1.1 技术方案 + +**语义索引**:由于我们既有无监督数据,又有有监督数据,所以结合 SimCSE 和 In-batch Negatives 方案,并采取 Domain-adaptive Pretraining 优化模型效果。 + +首先是利用 ERNIE 1.0 模型进行 Domain-adaptive Pretraining,在得到的预训练模型基础上,进行无监督的 SimCSE 训练,最后利用 In-batch Negatives 方法进行微调,得到最终的语义索引模型,把建库的文本放入模型中抽取特征向量,然后把抽取后的向量放到语义索引引擎 milvus 中,利用 milvus 就可以很方便得实现召回了。 + +**排序**:使用 ERNIE-Gram 的单塔结构对召回后的数据精排序。 + +#### 3.1.2 评估指标 + +**模型效果指标** +* 在语义索引召回阶段使用的指标是 Recall@K,表示的是预测的前topK(从最后的按得分排序的召回列表中返回前K个结果)结果和语料库中真实的前 K 个相关结果的重叠率,衡量的是检索系统的查全率。 + +* 在排序阶段使用的指标为AUC,AUC反映的是分类器对样本的排序能力,如果完全随机得对样本分类,那么AUC应该接近0.5。分类器越可能把真正的正样本排在前面,AUC越大,分类性能越好。 + +**性能指标** +* 基于 Paddle Inference 快速抽取向量 + +* 建库性能和 ANN 查询性能快 + +### 3.2 数据说明 + +数据集来源于某文献检索系统,既有大量无监督数据,又有有监督数据。 + +(1)采用文献的 query, title,keywords,abstract 四个字段内容,构建无标签数据集进行 Domain-adaptive Pretraining; + +(2)采用文献的 query,title,keywords 三个字段内容,构造无标签数据集,进行无监督召回训练SimCSE; + +(3)使用文献的的query, title, keywords,构造带正标签的数据集,不包含负标签样本,基于 In-batch Negatives 策略进行训练; + +(4)在排序阶段,使用点击(作为正样本)和展现未点击(作为负样本)数据构造排序阶段的训练集,使用ERNIE-Gram模型进行精排训练。 + +| 阶段 |模型 | 训练集 | 评估集(用于评估模型效果) | 召回库 |测试集 | +| ------------ | ------------ |------------ | ------------ | ------------ | ------------ | +| 召回 | Domain-adaptive Pretraining | 2kw | - | - | - | +| 召回 | 无监督预训练 - SimCSE | 798w | 20000 | 300000| 1000 | +| 召回 | 有监督训练 - In-batch Negatives | 3998 | 20000 |300000 | 1000 | +| 排序 | 有监督训练 - ERNIE-Gram单塔 Pairwise| 1973538 | 57811 | - | 1000 | + +我们将除 Domain-adaptive Pretraining 之外的其他数据集全部开源,下载地址: + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据(模拟实际业务线上的语料库,实际语料库远大于这里的规模),用于直观演示相关文献召回效果 +├── recall # 召回阶段数据集 + ├── train_unsupervised.csv # 无监督训练集,用于训练 SimCSE + ├── train.csv # 有监督训练集,用于训练 In-batch Negative + ├── dev.csv # 召回阶段验证集,用于评估召回模型的效果,SimCSE 和 In-batch Negative 共用 + ├── corpus.csv # 构建召回库的数据(模拟实际业务线上的语料库,实际语料库远大于这里的规模),用于评估召回阶段模型效果,SimCSE 和 In-batch Negative 共用 + ├── test.csv # 召回阶段测试数据,预测文本之间的相似度,SimCSE 和 In-batch Negative 共用 +├── sort # 排序阶段数据集 + ├── train_pairwise.csv # 排序训练集 + ├── dev_pairwise.csv # 排序验证集 + └── test_pairwise.csv # 排序测试集 +``` + +### 3.3 运行环境和安装说明 + + +(1)运行环境 + +本实验采用了以下的运行环境进行,详细说明如下,用户也可以在自己 GPU 硬件环境进行: + +a. 软件环境: + + +- python >= 3.6 +- paddlenlp >= 2.2.1 +- paddlepaddle-gpu >=2.2 +- CUDA Version: 10.2 +- NVIDIA Driver Version: 440.64.00 +- Ubuntu 16.04.6 LTS (Docker) + + +b. 硬件环境: + + +- NVIDIA Tesla V100 16GB x4卡 +- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +c. 依赖安装: + +``` +pip install -r requirements.txt +``` + +## 4. 动手实践——搭建自己的检索系统 + +这里展示了能够从头至尾跑通的完整代码,您使用自己的业务数据,照着跑,能搭建出一个给定 Query,返回 topK 相关文档的小型检索系统。您可以参照我们给出的效果和性能数据来检查自己的运行过程是否正确。 + +### 4.1 召回阶段 + +**召回模型训练** + +这里采用 Domain-adaptive Pretraining + SimCSE + In-batch Negatives 方案: + +第一步:无监督训练 Domain-adaptive Pretraining + +训练用时 16hour55min,可参考:[ERNIE 1.0](./recall/domain_adaptive_pretraining/) + +第二步:无监督训练 SimCSE + +训练用时 16hour53min,可参考:[SimCSE](./recall/simcse/) + +第三步:有监督训练 + +几分钟内训练完成,可参考 [In-batch Negatives](./recall/in_batch_negative/) + + +此外,我们进行了多组实践,用来对比说明召回阶段各方案的效果: + +| 模型 | Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |策略简要说明| +| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- | +| 有监督训练 Baseline | 30.077| 43.513| 48.633 | 53.448 |59.632| 标准 pair-wise 训练范式,通过随机采样产生负样本| +| 有监督训练 In-batch Negatives | 51.301 | 65.309| 69.878| 73.996|78.881| In-batch Negatives 有监督训练| +| 无监督训练 SimCSE | 42.374 | 57.505| 62.641| 67.09|72.331| SimCSE 无监督训练| +| 无监督 + 有监督训练 SimCSE + In-batch Negatives | 55.976 | 71.849| 76.363| 80.49|84.809| SimCSE无监督训练,In-batch Negatives 有监督训练| +| Domain-adaptive Pretraining + SimCSE | 51.031 | 66.648| 71.338 | 75.676 |80.144| ERNIE 预训练,SimCSE 无监督训练| +| Domain-adaptive Pretraining + SimCSE + In-batch Negatives| **58.248** | **75.099**| **79.813**| **83.801**|**87.733**| ERNIE 预训练,SimCSE 无监督训训练,In-batch Negatives 有监督训练| + +从上述表格可以看出,首先利用ERNIE 1.0 做 Domain-adaptive Pretraining ,然后把训练好的模型加载到 SimCSE 上进行无监督训练,最后利用 In-batch Negatives 在有监督数据上进行训练能够获得最佳的性能。 + +**召回系统搭建** + +召回系统使用索引引擎 Milvus,可参考 [milvus_system](./recall/milvus/)。 +我们展示一下系统的效果,输入的文本如下: + +``` +中西方语言与文化的差异 + +``` +下面是召回的部分结果,第一个是召回的title,第二个数字是计算的相似度距离 + +``` +跨文化中的文化习俗对翻译的影响翻译,跨文化,文化习俗 0.615584135055542 +试论翻译过程中的文化差异与语言空缺翻译过程,文化差异,语言空缺,文化对比 0.6155391931533813 +中英文化差异及习语翻译习语,文化差异,翻译 0.6153547763824463 +英语中的中国文化元素英语,中国文化,语言交流 0.6151996850967407 +跨文化交际中的文化误读研究文化误读,影响,中华文化,西方文明 0.6137217283248901 +在语言学习中了解中法文化差异文化差异,对话交际,语言 0.6134252548217773 +从翻译视角看文化差异影响下的中式英语的应对策略文化差异;中式英语现;汉英翻译;动态对等理论 0.6127341389656067 +归化与异化在跨文化传播中的动态平衡归化,异化,翻译策略,跨文化传播,文化外译 0.6127211451530457 +浅谈中西言语交际行为中的文化差异交际用语,文化差异,中国,西方 0.6125463843345642 +翻译中的文化因素--异化与归化文化翻译,文化因素,异化与归化 0.6111845970153809 +历史与文化差异对翻译影响的分析研究历史与文化差异,法汉翻译,翻译方法 0.6107486486434937 +从中、韩、美看跨文化交际中的东西方文化差异跨文化交际,东西方,文化差异 0.6091923713684082 +试论文化差异对翻译工作的影响文化差异,翻译工作,影响 0.6084284782409668 +从归化与异化看翻译中的文化冲突现象翻译,文化冲突,归化与异化,跨文化交际 0.6063553690910339 +中西方问候语的文化差异问候语,文化差异,文化背景 0.6054259538650513 +中英思维方式的差异对翻译的影响中英文化的差异,中英思维方式的差异,翻译 0.6026732921600342 +略论中西方语言文字的特性与差异语言,会意,确意,特性,差异 0.6009351015090942 +...... + +``` + + +### 4.2 排序阶段 + +排序阶段使用的模型是 ERNIE-Gram,用时 20h,可参考: + +[ernie_matching](./ranking/ernie_matching/) + +排序阶段的效果评估: + +| 模型 | AUC | +| ------------ | ------------ | +| Baseline: In-batch Negatives | 0.582 | +| ERNIE-Gram | 0.801 | + +同样输入文本: + +``` +中西方语言与文化的差异 +``` +排序阶段的结果展示如下,第一个是 Title ,第二个数字是计算的概率,显然经排序阶段筛选的文档与 Query 更相关: + +``` +中西方文化差异以及语言体现中西方文化,差异,语言体现 0.999848484992981 +论中西方语言与文化差异的历史渊源中西方语言,中西方文化,差异,历史渊源 0.9998375177383423 +从日常生活比较中西方语言与文化的差异中西方,语言,文化,比较 0.9985846281051636 +试论中西方语言文化教育的差异比较与融合中西方,语言文化教育,差异 0.9972485899925232 +中西方文化差异对英语学习的影响中西方文化,差异,英语,学习 0.9831035137176514 +跨文化视域下的中西文化差异研究跨文化,中西,文化差异 0.9781349897384644 +中西方文化差异对跨文化交际的影响分析文化差异,跨文化交际,影响 0.9735479354858398 +探析跨文化交际中的中西方语言差异跨文化交际,中西方,语言差异 0.9668175578117371 +中西方文化差异解读中英文差异表达中西文化,差异表达,跨文化交际 0.9629314541816711 +中西方文化差异对英语翻译的影响中西方文化差异,英语翻译,翻译策略,影响 0.9538986086845398 +论跨文化交际中的中西方文化冲突跨文化交际,冲突,文化差异,交际策略,全球化 0.9493677616119385 +中西方文化差异对英汉翻译的影响中西方文化,文化差异,英汉翻译,影响 0.9430705904960632 +中西方文化差异与翻译中西方,文化差异,翻译影响,策略方法,译者素质 0.9401137828826904 +外语教学中的中西文化差异外语教学,文化,差异 0.9397934675216675 +浅析西语国家和中国的文化差异-以西班牙为例跨文化交际,西语国家,文化差异 0.9373322129249573 +中英文化差异在语言应用中的体现中英文化,汉语言,语言应用,语言差异 0.9359155297279358 +.... +``` + + +## Reference + +[1] Tianyu Gao, Xingcheng Yao, Danqi Chen: [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821). EMNLP (1) 2021: 6894-6910 + +[2] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, [Dense Passage Retrieval for Open-Domain Question Answering](https://arxiv.org/abs/2004.04906). Preprint 2020. + +[3] Dongling Xiao, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang: +[ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding](https://arxiv.org/abs/2010.12148). NAACL-HLT 2021: 1702-1715 + +[4] Yu Sun, Shuohuan Wang, Yu-Kun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, Hua Wu: +[ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/abs/1904.09223). CoRR abs/1904.09223 (2019) diff --git a/application/neural_search/img/mem.png b/application/neural_search/img/mem.png new file mode 100644 index 000000000000..93d770e9f04c Binary files /dev/null and b/application/neural_search/img/mem.png differ diff --git a/application/neural_search/img/system_pipeline.png b/application/neural_search/img/system_pipeline.png new file mode 100644 index 000000000000..b7ef7972df94 Binary files /dev/null and b/application/neural_search/img/system_pipeline.png differ diff --git a/application/neural_search/ranking/ernie_matching/README.md b/application/neural_search/ranking/ernie_matching/README.md new file mode 100644 index 000000000000..96698c65a94f --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/README.md @@ -0,0 +1,282 @@ + + **目录** + +* [背景介绍](#背景介绍) +* [ERNIE-Gram](#ERNIE-Gram) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 模型训练](#模型训练) + * [6. 评估](#开始评估) + * [7. 预测](#预测) + * [8. 部署](#部署) + + + +# 背景介绍 + +基于ERNIE-Gram训练Pair-wise模型。Pair-wise 匹配模型适合将文本对相似度作为特征之一输入到上层排序模块进行排序的应用场景。 + + + + +# ERNIE-Gram + + + +## 1. 技术方案和评估指标 + +### 技术方案 + +双塔模型,使用ERNIE-Gram预训练模型,使用margin_ranking_loss训练模型。 + + +### 评估指标 + +(1)采用 AUC 指标来评估排序模型的排序效果。 + +**效果评估** + +| 模型 | AUC | +| ------------ | ------------ | +| ERNIE-Gram | 0.801 | + + + +## 2. 环境依赖和安装说明 + +**环境依赖** + +* python >= 3.x +* paddlepaddle >= 2.1.3 +* paddlenlp >= 2.2 +* pandas >= 0.25.1 +* scipy >= 1.3.1 + + + +## 3. 代码结构 + +以下是本项目主要代码结构及说明: + +``` +ernie_matching/ +├── deply # 部署 + └── python + ├── deploy.sh # 预测部署bash脚本 + └── predict.py # python 预测部署示例 +|—— scripts + ├── export_model.sh # 动态图参数导出静态图参数的bash文件 + ├── train_pairwise.sh # Pair-wise 单塔匹配模型训练的bash文件 + ├── evaluate.sh # 评估验证文件bash脚本 + ├── predict_pairwise.sh # Pair-wise 单塔匹配模型预测脚本的bash文件 +├── export_model.py # 动态图参数导出静态图参数脚本 +├── model.py # Pair-wise 匹配模型组网 +├── data.py # Pair-wise 训练样本的转换逻辑 、Pair-wise 生成随机负例的逻辑 +├── train_pairwise.py # Pair-wise 单塔匹配模型训练脚本 +├── evaluate.py # 评估验证文件 +├── predict_pairwise.py # Pair-wise 单塔匹配模型预测脚本,输出文本对是相似度 + +``` + + + +## 4. 数据准备 + +### 数据集说明 + +样例数据如下: +``` +个人所得税税务筹划 基于新个税视角下的个人所得税纳税筹划分析新个税;个人所得税;纳税筹划 个人所得税工资薪金税务筹划研究个人所得税,工资薪金,税务筹划 +液压支架底座受力分析 ZY4000/09/19D型液压支架的有限元分析液压支架,有限元分析,两端加载,偏载,扭转 基于ANSYS的液压支架多工况受力分析液压支架,四种工况,仿真分析,ANSYS,应力集中,优化 +迟发性血管痉挛 西洛他唑治疗动脉瘤性蛛网膜下腔出血后脑血管痉挛的Meta分析西洛他唑,蛛网膜下腔出血,脑血管痉挛,Meta分析 西洛他唑治疗动脉瘤性蛛网膜下腔出血后脑血管痉挛的Meta分析西洛他唑,蛛网膜下腔出血,脑血管痉挛,Meta分析 +氧化亚硅 复合溶胶-凝胶一锅法制备锂离子电池氧化亚硅/碳复合负极材料氧化亚硅,溶胶-凝胶法,纳米颗粒,负极,锂离子电池 负载型聚酰亚胺-二氧化硅-银杂化膜的制备和表征聚酰亚胺,二氧化硅,银,杂化膜,促进传输 +``` + + +### 数据集下载 + + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据 +├── recall # 召回(语义索引)数据集 + ├── corpus.csv # 用于测试的召回库 + ├── dev.csv # 召回验证集 + ├── test.csv # 召回测试集 + ├── train.csv # 召回训练集 + ├── train_unsupervised.csv # 无监督训练集 +├── sort # 排序数据集 + ├── test_pairwise.csv # 排序测试集 + ├── dev_pairwise.csv # 排序验证集 + └── train_pairwise.csv # 排序训练集 + +``` + + + +## 5. 模型训练 + +**排序模型下载链接:** + + +|Model|训练参数配置|硬件|MD5| +| ------------ | ------------ | ------------ |-----------| +|[ERNIE-Gram-Sort](https://bj.bcebos.com/v1/paddlenlp/models/ernie_gram_sort.zip)|
epoch:3 lr:5E-5 bs:64 max_len:64
|
4卡 v100-16g
|d24ece68b7c3626ce6a24baa58dd297d| + + +### 训练环境说明 + + +- NVIDIA Driver Version: 440.64.00 +- Ubuntu 16.04.6 LTS (Docker) +- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1,2,3 卡, 基于ERNIE-Gram训练模型,数据量比较大,需要20小时10分钟左右。如果采用单机单卡训练,只需要把`--gpu`参数设置成单卡的卡号即可 + +训练的命令如下: + +``` +python -u -m paddle.distributed.launch --gpus "0,2,3,4" train_pairwise.py \ + --device gpu \ + --save_dir ./checkpoints \ + --batch_size 32 \ + --learning_rate 2E-5 \ + --margin 0.1 \ + --eval_step 100 \ + --train_file data/train_pairwise.csv \ + --test_file data/dev_pairwise.csv +``` +也可以运行bash脚本: + +``` +sh scripts/train_pairwise.sh +``` + + + +## 6. 评估 + + +``` +unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus "0" evaluate.py \ + --device gpu \ + --batch_size 32 \ + --learning_rate 2E-5 \ + --init_from_ckpt "./checkpoints/model_30000/model_state.pdparams" \ + --test_file data/dev_pairwise.csv +``` +也可以运行bash脚本: + +``` +sh scripts/evaluate.sh +``` + + +成功运行后会输出下面的指标: + +``` +eval_dev auc:0.796 +``` + + + +## 7. 预测 + +### 准备预测数据 + +待预测数据为 tab 分隔的 tsv 文件,每一行为 1 个文本 Pair,和文本pair的语义索引相似度,部分示例如下: + +``` +中西方语言与文化的差异 第二语言习得的一大障碍就是文化差异。 0.5160342454910278 +中西方语言与文化的差异 跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译 0.5145505666732788 +中西方语言与文化的差异 从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译 0.5141439437866211 +中西方语言与文化的差异 中英文化差异对翻译的影响中英文化,差异,翻译的影响 0.5138794183731079 +中西方语言与文化的差异 浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际 0.5131710171699524 +``` + + + +### 开始预测 + +以上述 demo 数据为例,运行如下命令基于我们开源的 ERNIE-Gram模型开始计算文本 Pair 的语义相似度: + +```shell +python -u -m paddle.distributed.launch --gpus "0" \ + predict_pairwise.py \ + --device gpu \ + --params_path "./checkpoints/model_30000/model_state.pdparams"\ + --batch_size 128 \ + --max_seq_length 64 \ + --input_file 'sort/test_pairwise.csv' +``` +也可以直接执行下面的命令: + +``` +sh scripts/predict_pairwise.sh +``` +得到下面的输出,分别是query,title和对应的预测概率: + +``` +{'query': '中西方语言与文化的差异', 'title': '第二语言习得的一大障碍就是文化差异。', 'pred_prob': 0.85112214} +{'query': '中西方语言与文化的差异', 'title': '跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译', 'pred_prob': 0.78629625} +{'query': '中西方语言与文化的差异', 'title': '从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译', 'pred_prob': 0.91767526} +{'query': '中西方语言与文化的差异', 'title': '中英文化差异对翻译的影响中英文化,差异,翻译的影响', 'pred_prob': 0.8601749} +{'query': '中西方语言与文化的差异', 'title': '浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际', 'pred_prob': 0.8944413} +``` + + + +## 8. 部署 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py --params_path checkpoints/model_30000/model_state.pdparams --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### Paddle Inference + +修改预测文件路径: + +``` +input_file='../../sort/test_pairwise.csv' +``` + +然后使用PaddleInference + +``` +python predict.py --model_dir=../../output +``` +也可以运行下面的bash脚本: + +``` +sh deploy.sh +``` +得到下面的输出,输出的是样本的query,title以及对应的概率: + +``` +Data: {'query': '中西方语言与文化的差异', 'title': '第二语言习得的一大障碍就是文化差异。'} prob: [0.8511221] +Data: {'query': '中西方语言与文化的差异', 'title': '跨文化视角下中国文化对外传播路径琐谈跨文化,中国文化,传播,翻译'} prob: [0.7862964] +Data: {'query': '中西方语言与文化的差异', 'title': '从中西方民族文化心理的差异看英汉翻译语言,文化,民族文化心理,思维方式,翻译'} prob: [0.91767514] +Data: {'query': '中西方语言与文化的差异', 'title': '中英文化差异对翻译的影响中英文化,差异,翻译的影响'} prob: [0.8601747] +Data: {'query': '中西方语言与文化的差异', 'title': '浅谈文化与语言习得文化,语言,文化与语言的关系,文化与语言习得意识,跨文化交际'} prob: [0.8944413] +``` + +## Reference + +[1] Xiao, Dongling, Yu-Kun Li, Han Zhang, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. “ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding.” ArXiv:2010.12148 [Cs]. diff --git a/application/neural_search/ranking/ernie_matching/data.py b/application/neural_search/ranking/ernie_matching/data.py new file mode 100644 index 000000000000..432c4e99154d --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/data.py @@ -0,0 +1,152 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import numpy as np + +from paddlenlp.datasets import MapDataset + + +def create_dataloader(dataset, + mode='train', + batch_size=1, + batchify_fn=None, + trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == 'train' else False + if mode == 'train': + batch_sampler = paddle.io.DistributedBatchSampler( + dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler( + dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader( + dataset=dataset, + batch_sampler=batch_sampler, + collate_fn=batchify_fn, + return_list=True) + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, 'r', encoding='utf-8') as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield {'query': data[0], 'title': data[1]} + + +def convert_pointwise_example(example, + tokenizer, + max_seq_length=512, + is_test=False): + + query, title = example["query"], example["title"] + + encoded_inputs = tokenizer( + text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids + + +def convert_pairwise_example(example, + tokenizer, + max_seq_length=512, + phase="train"): + + if phase == "train": + query, pos_title, neg_title = example["query"], example[ + "title"], example["neg_title"] + + pos_inputs = tokenizer( + text=query, text_pair=pos_title, max_seq_len=max_seq_length) + neg_inputs = tokenizer( + text=query, text_pair=neg_title, max_seq_len=max_seq_length) + + pos_input_ids = pos_inputs["input_ids"] + pos_token_type_ids = pos_inputs["token_type_ids"] + neg_input_ids = neg_inputs["input_ids"] + neg_token_type_ids = neg_inputs["token_type_ids"] + + return (pos_input_ids, pos_token_type_ids, neg_input_ids, + neg_token_type_ids) + + else: + query, title = example["query"], example["title"] + + inputs = tokenizer( + text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = inputs["input_ids"] + token_type_ids = inputs["token_type_ids"] + if phase == "eval": + return input_ids, token_type_ids, example["label"] + elif phase == "predict": + return input_ids, token_type_ids + else: + raise ValueError("not supported phase:{}".format(phase)) + + +def gen_pair(dataset, pool_size=100): + """ + Generate triplet randomly based on dataset + + Args: + dataset: A `MapDataset` or `IterDataset` or a tuple of those. + Each example is composed of 2 texts: exampe["query"], example["title"] + pool_size: the number of example to sample negative example randomly + + Return: + dataset: A `MapDataset` or `IterDataset` or a tuple of those. + Each example is composed of 2 texts: exampe["query"], example["pos_title"]、example["neg_title"] + """ + + if len(dataset) < pool_size: + pool_size = len(dataset) + + new_examples = [] + pool = [] + tmp_exmaples = [] + + for example in dataset: + label = example["label"] + + # Filter negative example + if label == 0: + continue + + tmp_exmaples.append(example) + pool.append(example["title"]) + + if len(pool) >= pool_size: + np.random.shuffle(pool) + for idx, example in enumerate(tmp_exmaples): + example["neg_title"] = pool[idx] + new_examples.append(example) + tmp_exmaples = [] + pool = [] + else: + continue + return MapDataset(new_examples) diff --git a/application/neural_search/ranking/ernie_matching/deploy/python/deploy.sh b/application/neural_search/ranking/ernie_matching/deploy/python/deploy.sh new file mode 100644 index 000000000000..56480e4dd5fa --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/deploy/python/deploy.sh @@ -0,0 +1 @@ +python deploy/python/predict.py --model_dir ./output \ No newline at end of file diff --git a/application/neural_search/ranking/ernie_matching/deploy/python/predict.py b/application/neural_search/ranking/ernie_matching/deploy/python/predict.py new file mode 100644 index 000000000000..0bb3a8ed4758 --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/deploy/python/predict.py @@ -0,0 +1,242 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os + +import numpy as np +import pandas as pd +import paddle +import paddlenlp as ppnlp +from scipy.special import softmax +from scipy.special import expit +from paddle import inference +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.utils.log import logger +import paddle.nn.functional as F +import sys + +sys.path.append('.') + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, + help="The directory to static model.") + +parser.add_argument("--max_seq_length", default=128, type=int, + help="The maximum total input sequence length after tokenization. Sequences " + "longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, + help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", + help="Select which device to train model, defaults to gpu.") + +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], + help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], + help='The tensorrt precision.') + +parser.add_argument('--cpu_threads', default=10, type=int, + help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], + help='Enable to use mkldnn to speed up when using cpu.') + +parser.add_argument("--benchmark", type=eval, default=False, + help="To log some information about environment and running.") +parser.add_argument("--save_log_path", type=str, default="./log_output/", + help="The file path to save log.") +args = parser.parse_args() +# yapf: enable + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, 'r', encoding='utf-8') as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield {'query': data[0], 'title': data[1]} + + +def convert_example(example, tokenizer, max_seq_length=512, is_test=False): + + query, title = example["query"], example["title"] + + encoded_inputs = tokenizer( + text=query, text_pair=title, max_seq_len=max_seq_length) + + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + + if not is_test: + label = np.array([example["label"]], dtype="int64") + return input_ids, token_type_ids, label + else: + return input_ids, token_type_ids + + +class Predictor(object): + def __init__(self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as intialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8 + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, + min_subgraph_size=30, + precision_mode=precision_mode) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [ + self.predictor.get_input_handle(name) + for name in self.predictor.get_input_names() + ] + self.output_handle = self.predictor.get_output_handle( + self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name="ernie-tiny", + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=args.save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=[ + 'preprocess_time', 'inference_time', 'postprocess_time' + ], + warmup=0, + logger=logger) + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + label_map(obj:`dict`): The label id (key) to label str (value) map. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for text in data: + input_ids, segment_ids = convert_example( + text, + tokenizer, + max_seq_length=self.max_seq_length, + is_test=True) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + sim_score = self.output_handle.copy_to_cpu() + if args.benchmark: + self.autolog.times.stamp() + sim_score = expit(sim_score) + + if args.benchmark: + self.autolog.times.end(stamp=True) + + return sim_score + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor(args.model_dir, args.device, args.max_seq_length, + args.batch_size, args.use_tensorrt, args.precision, + args.cpu_threads, args.enable_mkldnn) + + tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained( + 'ernie-gram-zh') + + # test_ds = load_dataset("lcqmc", splits=["test"]) + input_file='sort/test_pairwise.csv' + + test_ds = load_dataset(read_text_pair,data_path=input_file, lazy=False) + + data = [{'query': d['query'], 'title': d['title']} for d in test_ds] + + batches = [ + data[idx:idx + args.batch_size] + for idx in range(0, len(data), args.batch_size) + ] + + results = [] + for batch_data in batches: + results.extend(predictor.predict(batch_data, tokenizer)) + for idx, text in enumerate(data): + print('Data: {} \t prob: {}'.format(text, results[idx])) + if args.benchmark: + predictor.autolog.report() diff --git a/application/neural_search/ranking/ernie_matching/evaluate.py b/application/neural_search/ranking/ernie_matching/evaluate.py new file mode 100644 index 000000000000..467a86d27c50 --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/evaluate.py @@ -0,0 +1,161 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import argparse +import os +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F + +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup + +from data import create_dataloader, gen_pair +from data import convert_pairwise_example as convert_example +from model import PairwiseMatching +import pandas as pd +from tqdm import tqdm + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--margin", default=0.1, type=float, help="Margin for pos_score and neg_score.") +parser.add_argument("--test_file", type=str, required=True, help="The full path of test file") + +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--eval_step", default=200, type=int, help="Step interval for evaluation.") +parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proption over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, phase="dev"): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + + for idx, batch in enumerate(data_loader): + input_ids, token_type_ids, labels = batch + + pos_probs = model.predict( + input_ids=input_ids, token_type_ids=token_type_ids) + + neg_probs = 1.0 - pos_probs + + preds = np.concatenate((neg_probs, pos_probs), axis=1) + metric.update(preds=preds, labels=labels) + + print("eval_{} auc:{:.3}".format(phase, metric.accumulate())) + metric.reset() + model.train() + +# 构建读取函数,读取原始数据 +def read(src_path, is_predict=False): + data=pd.read_csv(src_path,sep='\t') + for index, row in tqdm(data.iterrows()): + query=row['query'] + title=row['title'] + neg_title=row['neg_title'] + yield {'query':query, 'title':title,'neg_title':neg_title} + +def read_test(src_path, is_predict=False): + data=pd.read_csv(src_path,sep='\t') + for index, row in tqdm(data.iterrows()): + query=row['query'] + title=row['title'] + label=row['label'] + yield {'query':query, 'title':title,'label':label} + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + + dev_ds=load_dataset(read_test,src_path=args.test_file,lazy=False) + print(dev_ds[0]) + + pretrained_model = ppnlp.transformers.ErnieGramModel.from_pretrained( + 'ernie-gram-zh') + tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained( + 'ernie-gram-zh') + + + trans_func_eval = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + phase="eval") + + + batchify_fn_eval = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # pair_segment + Stack(dtype="int64") # label + ): [data for data in fn(samples)] + + + dev_data_loader = create_dataloader( + dev_ds, + mode='dev', + batch_size=args.batch_size, + batchify_fn=batchify_fn_eval, + trans_fn=trans_func_eval) + + model = PairwiseMatching(pretrained_model, margin=args.margin) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + metric = paddle.metric.Auc() + evaluate(model, metric, dev_data_loader, "dev") + + + +if __name__ == "__main__": + do_train() diff --git a/application/neural_search/ranking/ernie_matching/export_model.py b/application/neural_search/ranking/ernie_matching/export_model.py new file mode 100644 index 000000000000..f500a2e73968 --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/export_model.py @@ -0,0 +1,63 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad + +from model import PairwiseMatching + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + # If you want to use ernie1.0 model, plesace uncomment the following code + # tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + # pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0") + + pretrained_model = ppnlp.transformers.ErnieGramModel.from_pretrained( + 'ernie-gram-zh') + tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained( + 'ernie-gram-zh') + model = PairwiseMatching(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec( + shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec( + shape=[None, None], dtype="int64") # segment_ids + ]) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) diff --git a/application/neural_search/ranking/ernie_matching/model.py b/application/neural_search/ranking/ernie_matching/model.py new file mode 100644 index 000000000000..0b10a145c35f --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/model.py @@ -0,0 +1,89 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class PairwiseMatching(nn.Layer): + def __init__(self, pretrained_model, dropout=None, margin=0.1): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + self.margin = margin + + # hidden_size -> 1, calculate similarity + self.similarity = nn.Linear(self.ptm.config["hidden_size"], 1) + + @paddle.jit.to_static(input_spec=[paddle.static.InputSpec(shape=[None, None], dtype='int64'),paddle.static.InputSpec(shape=[None, None], dtype='int64')]) + def get_pooled_embedding(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, + position_ids, attention_mask) + cls_embedding = self.dropout(cls_embedding) + sim = self.similarity(cls_embedding) + return sim + + + def predict(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, + attention_mask) + + cls_embedding = self.dropout(cls_embedding) + sim_score = self.similarity(cls_embedding) + sim_score = F.sigmoid(sim_score) + + return sim_score + + def forward(self, + pos_input_ids, + neg_input_ids, + pos_token_type_ids=None, + neg_token_type_ids=None, + pos_position_ids=None, + neg_position_ids=None, + pos_attention_mask=None, + neg_attention_mask=None): + + _, pos_cls_embedding = self.ptm(pos_input_ids, pos_token_type_ids, + pos_position_ids, pos_attention_mask) + + _, neg_cls_embedding = self.ptm(neg_input_ids, neg_token_type_ids, + neg_position_ids, neg_attention_mask) + + pos_embedding = self.dropout(pos_cls_embedding) + neg_embedding = self.dropout(neg_cls_embedding) + + pos_sim = self.similarity(pos_embedding) + neg_sim = self.similarity(neg_embedding) + + pos_sim = F.sigmoid(pos_sim) + neg_sim = F.sigmoid(neg_sim) + + labels = paddle.full( + shape=[pos_cls_embedding.shape[0]], fill_value=1.0, dtype='float32') + + loss = F.margin_ranking_loss( + pos_sim, neg_sim, labels, margin=self.margin) + + return loss diff --git a/application/neural_search/ranking/ernie_matching/predict_pairwise.py b/application/neural_search/ranking/ernie_matching/predict_pairwise.py new file mode 100644 index 000000000000..7bcaece7a3a7 --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/predict_pairwise.py @@ -0,0 +1,129 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import argparse +import sys +import os +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.datasets import load_dataset +from paddlenlp.data import Stack, Tuple, Pad + +from data import create_dataloader, read_text_pair +from data import convert_pairwise_example as convert_example +from model import PairwiseMatching + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--input_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loaer (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + batch_probs = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + batch_prob = model.predict( + input_ids=input_ids, token_type_ids=token_type_ids).numpy() + + batch_probs.append(batch_prob) + if(len(batch_prob)==1): + batch_probs=np.array(batch_probs) + else: + batch_probs = np.concatenate(batch_probs, axis=0) + + return batch_probs + + +if __name__ == "__main__": + paddle.set_device(args.device) + + # If you want to use ernie1.0 model, plesace uncomment the following code + # tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + # pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0") + + pretrained_model = ppnlp.transformers.ErnieGramModel.from_pretrained( + 'ernie-gram-zh') + tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained( + 'ernie-gram-zh') + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + phase="predict") + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment_ids + ): [data for data in fn(samples)] + + valid_ds = load_dataset( + read_text_pair, data_path=args.input_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, + mode='predict', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + model = PairwiseMatching(pretrained_model) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError( + "Please set --params_path with correct pretrained model file") + + y_probs = predict(model, valid_data_loader) + + valid_ds = load_dataset( + read_text_pair, data_path=args.input_file, lazy=False) + + for idx, prob in enumerate(y_probs): + text_pair = valid_ds[idx] + text_pair["pred_prob"] = prob[0] + print(text_pair) \ No newline at end of file diff --git a/application/neural_search/ranking/ernie_matching/scripts/evaluate.sh b/application/neural_search/ranking/ernie_matching/scripts/evaluate.sh new file mode 100644 index 000000000000..bfb8c120a4cf --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/scripts/evaluate.sh @@ -0,0 +1,16 @@ +unset CUDA_VISIBLE_DEVICES +# gpu +python -u -m paddle.distributed.launch --gpus "0" evaluate.py \ + --device gpu \ + --batch_size 32 \ + --learning_rate 2E-5 \ + --init_from_ckpt "./checkpoints/model_30000/model_state.pdparams" \ + --test_file sort/dev_pairwise.csv + +# cpu +# python evaluate.py \ +# --device cpu \ +# --batch_size 32 \ +# --learning_rate 2E-5 \ +# --init_from_ckpt "./checkpoints/model_30000/model_state.pdparams" \ +# --test_file sort/dev_pairwise.csv \ No newline at end of file diff --git a/application/neural_search/ranking/ernie_matching/scripts/export_model.sh b/application/neural_search/ranking/ernie_matching/scripts/export_model.sh new file mode 100644 index 000000000000..402b82c31ac6 --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/scripts/export_model.sh @@ -0,0 +1 @@ +python export_model.py --params_path checkpoints/model_30000/model_state.pdparams --output_path=./output \ No newline at end of file diff --git a/application/neural_search/ranking/ernie_matching/scripts/predict_pairwise.sh b/application/neural_search/ranking/ernie_matching/scripts/predict_pairwise.sh new file mode 100644 index 000000000000..fe0767e14bfa --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/scripts/predict_pairwise.sh @@ -0,0 +1,15 @@ +# gpu +python -u -m paddle.distributed.launch --gpus "0" \ + predict_pairwise.py \ + --device gpu \ + --params_path "./checkpoints/model_30000/model_state.pdparams"\ + --batch_size 128 \ + --max_seq_length 64 \ + --input_file 'sort/test_pairwise.csv' +# cpu +# python predict_pairwise.py \ +# --device gpu \ +# --params_path "./checkpoints/model_30000/model_state.pdparams"\ +# --batch_size 128 \ +# --max_seq_length 64 \ +# --input_file 'sort/test_pairwise.csv' \ No newline at end of file diff --git a/application/neural_search/ranking/ernie_matching/scripts/train_pairwise.sh b/application/neural_search/ranking/ernie_matching/scripts/train_pairwise.sh new file mode 100644 index 000000000000..ebf63ba3e35d --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/scripts/train_pairwise.sh @@ -0,0 +1,21 @@ +# gpu +python -u -m paddle.distributed.launch --gpus "0,2,3,4" train_pairwise.py \ + --device gpu \ + --save_dir ./checkpoints \ + --batch_size 32 \ + --learning_rate 2E-5 \ + --margin 0.1 \ + --eval_step 100 \ + --train_file sort/train_pairwise.csv \ + --test_file sort/test_pairwise.csv + +# cpu +# python train_pairwise.py \ +# --device cpu \ +# --save_dir ./checkpoints \ +# --batch_size 32 \ +# --learning_rate 2E-5 \ +# --margin 0.1 \ +# --eval_step 100 \ +# --train_file sort/train_pairwise.csv \ +# --test_file sort/test_pairwise.csv \ No newline at end of file diff --git a/application/neural_search/ranking/ernie_matching/train_pairwise.py b/application/neural_search/ranking/ernie_matching/train_pairwise.py new file mode 100644 index 000000000000..f0cbe89a27b0 --- /dev/null +++ b/application/neural_search/ranking/ernie_matching/train_pairwise.py @@ -0,0 +1,238 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import argparse +import os +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F + +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup + +from data import create_dataloader, gen_pair +from data import convert_pairwise_example as convert_example +from model import PairwiseMatching +import pandas as pd +from tqdm import tqdm + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--margin", default=0.2, type=float, help="Margin for pos_score and neg_score.") +parser.add_argument("--train_file", type=str, required=True, help="The full path of train file") +parser.add_argument("--test_file", type=str, required=True, help="The full path of test file") + +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--eval_step", default=200, type=int, help="Step interval for evaluation.") +parser.add_argument('--save_step', default=10000, type=int, help="Step interval for saving checkpoint.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proption over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +@paddle.no_grad() +def evaluate(model, metric, data_loader, phase="dev"): + """ + Given a dataset, it evals model and computes the metric. + + Args: + model(obj:`paddle.nn.Layer`): A model to classify texts. + data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. + metric(obj:`paddle.metric.Metric`): The evaluation metric. + """ + model.eval() + metric.reset() + + for idx, batch in enumerate(data_loader): + input_ids, token_type_ids, labels = batch + + pos_probs = model.predict( + input_ids=input_ids, token_type_ids=token_type_ids) + + neg_probs = 1.0 - pos_probs + + preds = np.concatenate((neg_probs, pos_probs), axis=1) + metric.update(preds=preds, labels=labels) + + print("eval_{} auc:{:.3}".format(phase, metric.accumulate())) + metric.reset() + model.train() + +# 构建读取函数,读取原始数据 +def read(src_path, is_predict=False): + data=pd.read_csv(src_path,sep='\t') + for index, row in tqdm(data.iterrows()): + query=row['query'] + title=row['title'] + neg_title=row['neg_title'] + yield {'query':query, 'title':title,'neg_title':neg_title} + +def read_test(src_path, is_predict=False): + data=pd.read_csv(src_path,sep='\t') + for index, row in tqdm(data.iterrows()): + query=row['query'] + title=row['title'] + label=row['label'] + yield {'query':query, 'title':title,'label':label} + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + # train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"]) + + train_ds=load_dataset(read,src_path=args.train_file,lazy=False) + dev_ds=load_dataset(read_test,src_path=args.test_file,lazy=False) + print(train_ds[0]) + + # train_ds = gen_pair(train_ds) + + # If you want to use ernie1.0 model, plesace uncomment the following code + # pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained('ernie-1.0') + # tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + + pretrained_model = ppnlp.transformers.ErnieGramModel.from_pretrained( + 'ernie-gram-zh') + tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained( + 'ernie-gram-zh') + + trans_func_train = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + trans_func_eval = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + phase="eval") + + batchify_fn_train = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # pos_pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # pos_pair_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # neg_pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id) # neg_pair_segment + ): [data for data in fn(samples)] + + batchify_fn_eval = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # pair_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # pair_segment + Stack(dtype="int64") # label + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, + mode='train', + batch_size=args.batch_size, + batchify_fn=batchify_fn_train, + trans_fn=trans_func_train) + + dev_data_loader = create_dataloader( + dev_ds, + mode='dev', + batch_size=args.batch_size, + batchify_fn=batchify_fn_eval, + trans_fn=trans_func_eval) + + model = PairwiseMatching(pretrained_model, margin=args.margin) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, + args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + + metric = paddle.metric.Auc() + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids = batch + + loss = model( + pos_input_ids=pos_input_ids, + neg_input_ids=neg_input_ids, + pos_token_type_ids=pos_token_type_ids, + neg_token_type_ids=neg_token_type_ids) + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, + 10 / (time.time() - tic_train))) + tic_train = time.time() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + + if global_step % args.eval_step == 0 and rank == 0: + evaluate(model, metric, dev_data_loader, "dev") + + if global_step % args.save_step == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, 'model_state.pdparams') + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/application/neural_search/recall/domain_adaptive_pretraining/README.md b/application/neural_search/recall/domain_adaptive_pretraining/README.md new file mode 100644 index 000000000000..997addc96989 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/README.md @@ -0,0 +1,187 @@ + + **目录** + +* [背景介绍](#背景介绍) +* [ERNIE 1.0](#ERNIE1.0) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 模型训练](#模型训练) + * [6. 模型转换](#模型转换) + + + +# 背景介绍 + + +ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架,它将大数据预训练与多源丰富知识相结合,通过持续学习技术,不断吸收海量文本数据中词汇、结构、语义等方面的知识,实现模型效果不断进化。 + +ERNIE在情感分析、文本匹配、自然语言推理、词法分析、阅读理解、智能问答等16个公开数据集上全面显著超越世界领先技术,在国际权威的通用语言理解评估基准GLUE上,得分首次突破90分,获得全球第一。 +相关创新成果也被国际顶级学术会议AAAI、IJCAI收录。 +同时,ERNIE在工业界得到了大规模应用,如搜索引擎、新闻推荐、广告系统、语音交互、智能客服等。 + +本示例采用了全新数据流程,适配了ERNIE预训练任务,具有高效易用,方便快捷的特点。支持动态文本mask,自动断点训练重启等。 +用户可以根据自己的需求,灵活修改mask方式。具体可以参考`./data_tools/dataset_utils.py`中`create_masked_lm_predictions`函数。 +用户可以设置`checkpoint_steps`,间隔`checkpoint_steps`数,即保留最新的checkpoint到`model_last`文件夹。重启训练时,程序默认从最新checkpoint重启训练,学习率、数据集都可以恢复到checkpoint时候的状态。 + + + + +# ERNIE 1.0 + + + + +## 1. 技术方案和评估指标 + +### 技术方案 +采用ERNIE1.0预训练垂直领域的模型 + + + + +## 2. 环境依赖和安装说明 + +**环境依赖** +* python >= 3.6 +* paddlepaddle >= 2.1.3 +* paddlenlp >= 2.2 +* visualdl >=2.2.2 +* pybind11 + +安装命令 `pip install visualdl pybind11` + + + +## 3. 代码结构 + +以下是本项目主要代码结构及说明: + +``` +ERNIE 1.0/ +|—— scripts + |—— run_pretrain_static.sh # 静态图与训练bash脚本 +├── ernie_static_to_dynamic.py # 静态图转动态图 +├── run_pretrain_static.py # ernie1.0静态图预训练 +├── args.py # 预训练的参数配置文件 +└── data_tools # 预训练数据处理文件目录 +``` + + + +## 4. 数据准备 + +数据准备部分请移步[data_tools](./data_tools/)目录,根据文档,创建训练数据。 + +## 5. 模型训练 + +**领域适应模型下载链接:** + +|Model|训练参数配置|硬件|MD5| +| ------------ | ------------ | ------------ |-----------| +|[ERNIE 1.0](https://bj.bcebos.com/v1/paddlenlp/models/ernie_post.zip)|
max_lr:0.0001 min_lr:0.00001 bs:512 max_len:512
|
4卡 v100-32g
|-| + +### 训练环境说明 + + +- NVIDIA Driver Version: 440.64.00 +- Ubuntu 16.04.6 LTS (Docker) +- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1,2,3 卡, 基于SimCSE训练模型,数据量比较小,几分钟就可以完成。如果采用单机单卡训练,只需要把--pugs参数设置成单卡的卡号即可 + + + +### 模型训练 + + +``` +python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3" \ + --log_dir "output/$task_name/log" \ + run_pretrain_static.py \ + --model_type "ernie" \ + --model_name_or_path "ERNIE 1.0" \ + --input_dir "./data" \ + --output_dir "output/$task_name" \ + --max_seq_len 512 \ + --micro_batch_size 32 \ + --global_batch_size 128 \ + --sharding_degree 1\ + --dp_degree 4 \ + --use_sharding false \ + --use_amp true \ + --use_recompute false \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 200000 \ + --save_steps 100000 \ + --checkpoint_steps 5000 \ + --decay_steps 1980000 \ + --weight_decay 0.01\ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --num_workers 2 \ + --logging_freq 20\ + --eval_freq 1000 \ + --device "gpu" +``` +也可以直接运行脚本: + +``` +sh scripts/run_pretrain_static.sh +``` + +其中参数释义如下: +- `model_name_or_path` 要训练的模型或者之前训练的checkpoint。 +- `input_dir` 指定输入文件,可以使用目录,指定目录时将包括目录中的所有文件。 +- `output_dir` 指定输出文件。 +- `max_seq_len` 输入文本序列的长度。 +- `micro_batch_size` 单卡单次的 batch size大小。即单张卡运行一次前向网络的 batch size大小。 +- `global_batch_size` 全局的batch size大小,即一次参数更新等效的batch size。 +- `sharding_degree` 切参数切分的分组大小(如 sharding_degree=4 表示参数分为4组,分别到4个设备)。 +- `dp_degree` 数据并行参数。 +- `use_sharding` 开启sharding策略,sharding_degree > 1时,请设置为True。 +- `use_amp` 开启混合精度策略。 +- `use_recompute` 开启重计算策略。暂时未支持,后续将支持。 +- `max_lr` 训练学习率。 +- `min_lr` 学习率衰减的最小值。 +- `max_steps` 最大训练步数。 +- `save_steps` 保存模型间隔。 +- `checkpoint_steps` 模型checkpoint间隔,用于模型断点重启训练。 +- `weight_decay` 权重衰减参数。 +- `warmup_rate` 学习率warmup参数。 +- `grad_clip` 梯度裁剪范围。 +- `logging_freq` 日志输出间隔。 +- `eval_freq` 模型评估间隔。 +- `device` 训练设备。 + +注: +- 一般而言,需要设置 `mp_degree * sharding_degree` = 训练机器的总卡数。 +- 一般而言, `global_batch_size = micro_batch_size * sharding_degree * dp_degree`。可以使用梯度累积的方式增大`global_batch_size`。设置`global_batch_size`为理论值的整数倍是,默认启用梯度累积。 +- 训练断点重启,直接启动即可,程序会找到最新的checkpoint,开始重启训练。 + + + +## 6. 模型转换 + +### 静态图转动态图 + +修改代码中的路径: + +``` +static_model_path="./output/ERNIE 1.0-dp8-gb1024/model_last/static_vars" +``` +然后运行 +``` +python ernie_static_to_dynamic.py +``` +运行结束后,动态图的模型就会保存到ernie_checkpoint文件夹里,也可以根据情况,修改代码,保存到自己的指定路径 + +### 参考文献 + +- [ERNIE: Enhanced Representation through Knowledge Integration](https://arxiv.org/pdf/1904.09223.pdf) diff --git a/application/neural_search/recall/domain_adaptive_pretraining/args.py b/application/neural_search/recall/domain_adaptive_pretraining/args.py new file mode 100644 index 000000000000..a28cf747c1fc --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/args.py @@ -0,0 +1,115 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +from paddlenlp.utils.log import logger + + +def str2bool(v): + if v.lower() in ('yes', 'true', 't', 'y', '1'): + return True + elif v.lower() in ('no', 'false', 'f', 'n', '0'): + return False + else: + raise argparse.ArgumentTypeError('Unsupported value encountered.') + + +def parse_args(MODEL_CLASSES): + parser = argparse.ArgumentParser() + # yapf: disable + parser.add_argument("--model_type", default=None, type=str, required=True, help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys()), ) + parser.add_argument("--model_name_or_path", default=None, type=str, required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join( + sum([ list(classes[-1].pretrained_init_configuration.keys()) for classes in MODEL_CLASSES.values() ], [])),) + + # Train I/O config + parser.add_argument("--input_dir", default=None, type=str, required=True, help="The input directory where the data will be read from.", ) + parser.add_argument("--output_dir", default=None, type=str, required=True, help="The output directory where the training logs and checkpoints will be written.") + parser.add_argument("--split", type=str, default='949,50,1', help="Train/valid/test data split.") + + parser.add_argument("--max_seq_len", type=int, default=1024, help="Max sequence length.") + parser.add_argument("--micro_batch_size", default=8, type=int, help="Batch size per device for one step training.", ) + parser.add_argument("--global_batch_size", default=None, type=int, help="Global batch size for all training process. None for not check the size is valid. If we only use data parallelism, it should be device_num * micro_batch_size.") + + # Default training config + parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") + parser.add_argument("--grad_clip", default=0.0, type=float, help="Grad clip for the parameter.") + parser.add_argument("--max_lr", default=1e-5, type=float, help="The initial max learning rate for Adam.") + parser.add_argument("--min_lr", default=5e-5, type=float, help="The initial min learning rate for Adam.") + parser.add_argument("--warmup_rate", default=0.01, type=float, help="Linear warmup over warmup_steps for learing rate.") + + # Adam optimizer config + parser.add_argument("--adam_beta1", default=0.9, type=float, help="The beta1 for Adam optimizer. The exponential decay rate for the 1st moment estimates.") + parser.add_argument("--adam_beta2", default=0.999, type=float, help="The bate2 for Adam optimizer. The exponential decay rate for the 2nd moment estimates.") + parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") + + # Training steps config + parser.add_argument("--num_train_epochs", default=1, type=int, help="Total number of training epochs to perform.", ) + parser.add_argument("--max_steps", default=500000, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.") + parser.add_argument("--checkpoint_steps", type=int, default=500, help="Save checkpoint every X updates steps to the model_last folder.") + parser.add_argument("--save_steps", type=int, default=500, help="Save checkpoint every X updates steps.") + parser.add_argument("--decay_steps", default=360000, type=int, help="The steps use to control the learing rate. If the step > decay_steps, will use the min_lr.") + parser.add_argument("--logging_freq", type=int, default=1, help="Log every X updates steps.") + parser.add_argument("--eval_freq", type=int, default=500, help="Evaluate for every X updates steps.") + parser.add_argument("--eval_iters", type=int, default=10, help="Evaluate the model use X steps data.") + + # Config for 4D Parallelism + parser.add_argument("--use_sharding", type=str2bool, nargs='?', const=False, help="Use sharding Parallelism to training.") + parser.add_argument("--sharding_degree", type=int, default=1, help="Sharding degree. Share the parameters to many cards.") + parser.add_argument("--dp_degree", type=int, default=1, help="Data Parallelism degree.") + parser.add_argument("--mp_degree", type=int, default=1, help="Model Parallelism degree. Spliting the linear layers to many cards.") + parser.add_argument("--pp_degree", type=int, default=1, help="Pipeline Parallelism degree. Spliting the the model layers to different parts.") + parser.add_argument("--use_recompute", type=str2bool, nargs='?', const=False, help="Using the recompute to save the memory.") + + # AMP config + parser.add_argument("--use_amp", type=str2bool, nargs='?', const=False, help="Enable mixed precision training.") + parser.add_argument("--enable_addto", type=str2bool, nargs='?', const=True, default=True, help="Whether to enable the addto strategy for gradient accumulation or not. This is only used for AMP training.") + parser.add_argument("--scale_loss", type=float, default=128, help="The value of scale_loss for fp16. This is only used for AMP training.") + parser.add_argument("--hidden_dropout_prob", type=float, default=0.1, help="The hidden dropout prob.") + parser.add_argument("--attention_probs_dropout_prob", type=float, default=0.1, help="The attention probs dropout prob.") + + # Other config + parser.add_argument("--seed", type=int, default=1234, help="Random seed for initialization.") + parser.add_argument("--num_workers", type=int, default=2, help="Num of workers for DataLoader.") + parser.add_argument("--check_accuracy", type=str2bool, nargs='?', const=False, help="Check accuracy for training process.") + parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu", "xpu"], help="select cpu, gpu, xpu devices.") + parser.add_argument("--lr_decay_style", type=str, default="cosine", choices=["cosine", "none"], help="Learning rate decay style.") + parser.add_argument("--share_folder", type=str2bool, nargs='?', const=False, help="Use share folder for data dir and output dir on multi machine.") + + # Argument for bert + parser.add_argument("--masked_lm_prob", type=float, default=0.15, help="Mask token prob.") + parser.add_argument("--short_seq_prob", type=float, default=0.1, help="Short sequence prob.") + # yapf: enable + + args = parser.parse_args() + args.test_iters = args.eval_iters * 10 + + if args.check_accuracy: + if args.hidden_dropout_prob != 0: + args.hidden_dropout_prob = .0 + logger.warning( + "The hidden_dropout_prob should set to 0 for accuracy checking.") + if args.attention_probs_dropout_prob != 0: + args.attention_probs_dropout_prob = .0 + logger.warning( + "The attention_probs_dropout_prob should set to 0 for accuracy checking." + ) + + logger.info('{:20}:{}'.format("paddle commit id", paddle.version.commit)) + for arg in vars(args): + logger.info('{:20}:{}'.format(arg, getattr(args, arg))) + + return args diff --git a/application/neural_search/recall/domain_adaptive_pretraining/data_tools/Makefile b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/Makefile new file mode 100644 index 000000000000..8f9db7686696 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/Makefile @@ -0,0 +1,9 @@ +CXXFLAGS += -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color +CPPFLAGS += $(shell python3 -m pybind11 --includes) +LIBNAME = helpers +LIBEXT = $(shell python3-config --extension-suffix) + +default: $(LIBNAME)$(LIBEXT) + +%$(LIBEXT): %.cpp + $(CXX) $(CXXFLAGS) $(CPPFLAGS) $< -o $@ diff --git a/application/neural_search/recall/domain_adaptive_pretraining/data_tools/README.md b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/README.md new file mode 100644 index 000000000000..b3897f443de6 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/README.md @@ -0,0 +1,192 @@ +# PaddleNLP 预训练数据流程 + +本示例致力于打造基于PaddleNLP预训练模型的最佳实践。 + +我们将预训练数据过程划分为以下部分 + +- 原始数据转换,原始文本转换为jsonl的json字符串格式。 +- 数据ID化,断句、分词、tokenize转化为token id格式。 +- 训练index文件生成,生成train、valid、test的每个样本索引。 +- token动态mask(可选),python 层实时mask文本。 + +本目录下主要包含一下文件: +``` +../data_tools +├── create_pretraining_data.py +├── dataset_utils.py +├── ernie_dataset.py +├── helpers.cpp +├── Makefile +├── README.md +└── trans_to_json.py +``` +其中,`trans_to_json.py`是原始数据转化的脚本,将数据转化为json串格式。 +`create_pretraining_data.py`将jsonl文本,断句、分词后,tokenizer转化为token id。 +`dataset_utils.py`中包含了index生成、动态mask的实现。 +`ernie_dataset.py`通过调用`dataset_utils.py`的一些函数,产生ernie的输入dataset。 + +### 环境依赖 + + - tqdm + - numpy + - pybind11 + - lac (可选) + - zstandard (可选) + +安装命令`pip install tqdm numpy pybind11 lac zstandard`。另,部分功能需要`g++>=4.8`编译支持 + + +## 训练全流程数据Pipeline + +|步骤|阶段|数据格式| 样例| +|-|-|-|-| +| - |-|原始数据:
每个doc之间用空行间隔开
- 中文,默认每句换行符,作为句子结束。
- 英文,默认使用nltk判断句子结束 | ```百度,是一家中国互联网公司。```
```百度为用户提供搜索服务。```

```PaddleNLP是自然语言处理领域的优秀工具。``` | +|原始数据转换
`trans_to_json.py`|预处理|jsonl格式:每个doc对应一行json字符串| ```{"text": "百度是一家中国互联网公司。百度为..."}```
```{"text": "PaddleNLP是自然语言..."}``` +|数据ID化
`create_pretrain_data.py`|预处理| npy格式:数据id化后的token id
npz格式:数据句子、文章位置索引 | - +|训练index文件生成|训练启动|npy格式:
根据训练步数max_steps生成
train、valid、test的每个样本索引文件| - +|token动态mask(可选)| Dataset取数据 | 无 |- + + +## ERNIE预训练例子 + +下面以ERNIE预训练为例,简要介绍一下预训练的全流程。 + +### 原始数据 +首先下载样例数据: +``` +mkdir data && cd data +wget https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/baike.txt +cd .. +``` +### 原始数据转换 jsonl 格式 +使用`trans_to_json.py`转化为json串格式,下面是脚本的使用说明 +``` +optional arguments: + -h, --help show this help message and exit + --input_path INPUT_PATH + Path to you raw files. Folder or file path. + 必须设置,可以是文件夹或者单个文件。文件夹中的目录默认最多搜索两层子目录。 + --output_path OUTPUT_PATH + Path to save the output json files. + 必须设置,输出文件的名字。 + --json_key JSON_KEY The content key of json file. + 建议不修改,默认的key是text + --doc_spliter DOC_SPLITER + Spliter between documents. We will strip the line, if you use blank line to split doc, leave it blank. + 根据实际情况修改,默认空行作为文章换行符。 + --min_doc_length MIN_DOC_LENGTH + Minimal char of a documment. + 可选。过滤掉长度多短的文章,默认值10 + --workers WORKERS Number of worker processes to launch + 可选。多进程转化文件,适用于 input_path 中包含的文件数据较多的情况。每个文件,分配给不同worker处理 + --log_interval LOG_INTERVAL + Interval between progress updates. + 可选。此处的interval是值处理完文件个数的间隔。 + --no-merge Don't merge the file. + 可选。默认不开启这个选项,默认每个文件转换的jsonl文本,会拼接成到同一个文件。 + --no-shuffle Don't shuffle the file. + 可选。默认不开启这个选项,默认对处理完进行shuffle。 +``` +根据说明,我们使用下面简单命令,可以得到`baike_sample.jsonl`文件。此处,我们对文章所有doc进行了shuffle。 +```shell +python trans_to_json.py --input_path ./data --output_path baike_sample + +#查看数据 +head -1 baike_sample.jsonl +{"text": "百度手机助手:最具人气的应用商店\n百度手机助手是Android手机的权威资源平台,分发市场份额连续十个季度排名市场第一,拥有最全最好的应用、游戏、壁纸资源,帮助用户在海量资源中精准搜索、高速下载、轻松管理,万千汇聚,一触即得。\n"} +``` + +### 数据ID化 +本部分,我们使用 `create_pretraining_data.py` 脚本将前面得到的 `baike_sample.jsonl` 进行tokenize id化处理。 +``` +optional arguments: + -h, --help show this help message and exit + --model_name MODEL_NAME + What model to use. + 必须设置,如:ernie-1.0, 可以参考已有的模型名称 https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/transformers.rst + --tokenizer_name {ErnieTokenizer,BertTokenizer,GPTTokenizer,GPTChineseTokenizer} + What type of tokenizer to use. + 模型对应的tokenizer, 目前暂时只支持 Ernie,Bert,GPT +data input/output: + --input_path INPUT_PATH + Path to input JSON files. + 必须设置,输入文件jsonl的目录 + --output_prefix OUTPUT_PREFIX + Output prefix to store output file. + 必须设置,输出文件的名称。 + 假设名称为XXX,则会输出 XXX_ids.npy, XXX_idx.npz 两个文件。 + npy文件,数据id化后的token ids; npz文件,数据句子、文章位置索引。 + --data_format {JSON} Only support json format for now. One document per line. + 不需要设置。目前默认处理jsonl数据格式 + --json_key JSON_KEY For JSON format. Space separate listed of keys to extract from json + 文本串json的key值。同前面trans_to_json.py的json_key,默认text为key + --split_sentences Split documents into sentences. + 是否需要将文章划分成句子。一般而言,GPT不需要,Bert/Ernie模型需要 + +chinese words: + --chinese Is corpus need words segmentation step for chinese words. + 中文情形必须设置。处理的文本类型是否是中文。 + --cn_whole_word_segment + Is corpus need words segmentation step for chinese words WWM. + 可选。是否需要WWM策略。一般而言,Bert/Ernie模型需要,GPT不需要。 + --cn_seg_func {lac,seg,jieba} + Words segment function for chinese words. + 默认lac,jieba速度较快 + --cn_splited Is chinese corpus is splited in to words. + 分词后的文本,可选。设置此选项则,cn_seg_func不起作用。 + 例如分词后文本串 "百度 手机助手 是 Android 手机 的 权威 资源平台" + --cn_split_dimer CN_SPLIT_DIMER + Split dimer between chinese words. + 配合cn_splited使用,默认空格表示分词间隔。 + +common config: + --append_eos Append an token to the end of a document. + gpt模型专用,gpt设置此选项,表示doc结束。 + --log_interval LOG_INTERVAL + Interval between progress updates + 打印日志间隔,interval表示处理 文本行数/doc数的 间隔。 + --workers WORKERS Number of worker processes to launch + 处理文本id化的进程个数。 +``` +同过下面脚本转化,我们可以得到处理好的预训练数据,token ids:`baike_sample_ids.npy`, 文章索引信息`baike_sample_idx.npz`. +``` +python -u create_pretraining_data.py \ + --model_name ernie-1.0 \ + --tokenizer_name ErnieTokenizer \ + --input_path baike_sample.jsonl \ + --split_sentences\ + --chinese \ + --cn_whole_word_segment \ + --output_prefix baike_sample \ + --workers 1 \ + --log_interval 5 +``` + +### Ernie预训练开始 +得到了处理好的训练数据,就可以开始Ernie模型的预训练了。ernie预训练的代码在`examples/language_model/ernie-1.0`。 +简单将预处理好的数据,拷贝到data目录,即可开始Ernie模型预训练。 +``` +cd .. +mkdir data +mv ./data_tools/baike_sample* ./data 或者运行 sh mv_data.sh +sh run_static.sh +# 建议修改 run_static.sh 中的配置,将max_steps设置小一些。 +``` +代码说明: + +- ernie预训练使用的 dataset 代码文件在 `./data_tools/ernie_dataset.py` +- 数据集index生成,动态mask相关代码实现在`./data_tools/dataset_utils.py` + +用户可以根据自己的需求,灵活修改mask方式。具体可以参考`dataset_utils.py`中`create_masked_lm_predictions`函数。 +可以自定义的选项有do_whole_word_mask, favor_longer_ngram, do_permutation, geometric_dist等, +可以参考[Megatron](https://github.com/NVIDIA/Megatron-LM)使用这些lm_mask策略。 + +### FAQ + +#### C++代码编译失败怎么办? +- 请先检查pybind11包是否安装,g++、make工具是否正常。 +- 编译失败可能是本文件夹下的Makefile命令出现了一些问题。可以将Makefile中的python3、python3-config设置成完全的路径,如/usr/bin/python3.7。 + +## 参考内容 + +注: 大部分数据流程,参考自[Megatron](https://github.com/NVIDIA/Megatron-LM),特此表达感谢。 diff --git a/application/neural_search/recall/domain_adaptive_pretraining/data_tools/create_data.sh b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/create_data.sh new file mode 100644 index 000000000000..fb7a372f1453 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/create_data.sh @@ -0,0 +1,14 @@ +# wanfangdata/ +# python trans_to_json.py --input_path ./data --output_path baike_sample +python trans_to_json.py --input_path ./wanfangdata --output_path baike_sample + +python -u create_pretraining_data.py \ + --model_name ernie-1.0 \ + --tokenizer_name ErnieTokenizer \ + --input_path baike_sample.jsonl \ + --split_sentences\ + --chinese \ + --cn_whole_word_segment \ + --output_prefix baike_sample \ + --workers 1 \ + --log_interval 5 \ No newline at end of file diff --git a/application/neural_search/recall/domain_adaptive_pretraining/data_tools/create_pretraining_data.py b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/create_pretraining_data.py new file mode 100644 index 000000000000..fe1efecb0ab5 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/create_pretraining_data.py @@ -0,0 +1,398 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import io +import re +import argparse +import json +import multiprocessing +import sys +import time + +import numpy as np +from tqdm import tqdm + +import paddlenlp.transformers as tfs + +try: + import nltk + nltk_available = True +except ImportError: + nltk_available = False + + +def get_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + '--model_name', type=str, required=True, help='What model to use.') + parser.add_argument( + '--tokenizer_name', + type=str, + required=True, + choices=[ + 'ErnieTokenizer', 'BertTokenizer', 'GPTTokenizer', + 'GPTChineseTokenizer' + ], + help='What type of tokenizer to use.') + group = parser.add_argument_group(title='data input/output') + group.add_argument( + '--input_path', + type=str, + required=True, + help='Path to input JSON files.') + group.add_argument( + '--output_prefix', + type=str, + required=True, + help='Output prefix to store output file.') + group.add_argument( + '--data_format', + type=str, + default='text', + choices=['JSON'], + help='Only support json format for now. One document per line.') + group.add_argument( + '--json_key', + type=str, + default='text', + help='For JSON format. Space separate listed of keys to extract from json' + ) + group.add_argument( + '--split_sentences', + action='store_true', + help='Split documents into sentences.') + + group = parser.add_argument_group(title='chinese words') + group.add_argument( + '--chinese', + action='store_true', + help="Is corpus need words segmentation step for chinese words.") + group.add_argument( + '--cn_whole_word_segment', + action='store_true', + help="Is corpus need words segmentation step for chinese words WWM.") + group.add_argument( + '--cn_seg_func', + type=str, + default='lac', + choices=['lac', 'seg', 'jieba'], + help='Words segment function for chinese words.') + group.add_argument( + '--cn_splited', + action='store_true', + help="Is chinese corpus is splited in to words.") + group.add_argument( + '--cn_split_dimer', + type=str, + default=' ', + help="Split dimer between chinese words.") + + group = parser.add_argument_group(title='common config') + group.add_argument( + '--append_eos', + action='store_true', + help='Append an token to the end of a document.') + group.add_argument( + '--log_interval', + type=int, + default=100, + help='Interval between progress updates') + group.add_argument( + '--workers', + type=int, + default=1, + help='Number of worker processes to launch') + + args = parser.parse_args() + return args + + +def lexical_analysis_fn(): + from LAC import LAC + lac = LAC(mode="lac") + + def process(line): + words, _ = lac.run(line) + return words + + return process + + +def chinese_segmentation_fn(): + from LAC import LAC + lac_cws = LAC(mode='seg') + + def process(line): + words = lac.run(line) + return words + + return process + + +def jieba_segmentation_fn(): + import jieba + + def process(line): + words = jieba.cut(line) + return list(words) + + return process + + +CHINESE_SEG_FUNC = { + 'lac': lexical_analysis_fn(), + 'seg': chinese_segmentation_fn(), + 'jieba': jieba_segmentation_fn(), +} + + +def get_whole_word_mask_tokens(tokens, words, max_word_length=4): + """ + Do whole word mask on Chinese word. + First, we do Chinese word segmentation on the sequence of tokens, which are from the WordPiece tokenization. + Then, we add the '##' mark on chinese characters which are in the middle of Chinese words. + And if the tokens are not chinese characters, we just exploit the results of WordPiece tokenization as words. + Such as, + - text line : 通过利用mercer核,将样本从输入空间映射到高维特征空间,使原来没有显现的特征突现出来,取得了很好的图像分割效果。 + - the input tokens (after WordPiece): + ['通', '过', '利', '用', 'me', '##rc', '##er', '核', ',', '将', '样', '本', '从', '输', '入', '空', '间', '映', + '射', '到', '高', '维', '特', '征', '空', '间', ',', '使', '原', '来', '没', '有', '显', '现', '的', '特', '征', + '突', '现', '出', '来', ',', '取', '得', '了', '很', '好', '的', '图', '像', '分', '割', '效', '果', '。'] + - the Chinese words (after Chinese word segmentation like jieba) + ['通过', '利用', 'mercer', '核', ',', '将', '样本', '从', '输入', '空间', '映射', '到', '高维', '特征', + '空间', ',', '使', '原来', '没有', '显现', '的', '特征', '突现', '出来', ',', '取得', '了', '很', '好', + '的', '图像', '分割', '效果', '。'] + - the output whole word mask tokens: + ['通', '##过', '利', '##用', 'me', '##rc', '##er', '核', ',', '将', '样', '##本', '从', '输', '##入', + '空', '##间', '映', '##射', '到', '高', '##维', '特', '##征', '空', '##间', ',', '使', '原', '##来', + '没', '##有', '显', '##现', '的', '特', '##征', '突', '##现', '出', '##来', ',', '取', '##得', '了', + '很', '好', '的', '图', '##像', '分', '##割', '效', '##果', '。'] + + Args: + tokens(list(str)): The sequence of tokens, which are from the WordPiece tokenization. + words(list(str)): The sequence of Chinese words. + max_word_length(int, optional): + The maximum chinese character in Chinese words. It avoids too long Chinese word to be masked. + Defaults as 4. + + Returns: + new_tokens(list(str)): The new token will be done with whole word masking strategy. + + """ + + new_tokens = [] + # opt for long document + words_set = set(words) + i = 0 + while i < len(tokens): + # non-chinese character, then do word piece + if len(re.findall('[\u4E00-\u9FA5]', tokens[i])) == 0: + new_tokens.append(tokens[i]) + i += 1 + continue + + # add "##" mark on the middel tokens of Chinese words + # such as ["通过", "利用"] -> ["通", "##过", "利", "##用"] + has_add = False + for length in range(max_word_length, 0, -1): + if i + length > len(tokens): + continue + if ''.join(tokens[i:i + length]) in words_set: + new_tokens.append(tokens[i]) + for l in range(1, length): + new_tokens.append('##' + tokens[i + l]) + i += length + has_add = True + break + + if not has_add: + new_tokens.append(tokens[i]) + i += 1 + return new_tokens + + +class IdentitySplitter(object): + def tokenize(self, *text): + return text + + +class NewlineSplitter(): + def tokenize(self, text): + return text.split("\n") + + +class Converter(object): + def __init__(self, args): + self.args = args + + def initializer(self): + Converter.tokenizer = getattr( + tfs, self.args.tokenizer_name).from_pretrained(self.args.model_name) + + # Split document to sentence. + if self.args.split_sentences: + if self.args.chinese: + Converter.splitter = NewlineSplitter() + else: + if not nltk_available: + print("NLTK is not available to split sentences.") + exit() + splitter = nltk.load("tokenizers/punkt/english.pickle") + Converter.splitter = splitter + else: + Converter.splitter = IdentitySplitter() + + # Split sentence whole words mask for chinese + if self.args.cn_whole_word_segment: + if self.args.cn_splited: + Converter.segment_func = lambda text: text.split(self.args.cn_split_dimer) + else: + Converter.segment_func = CHINESE_SEG_FUNC[self.args.cn_seg_func] + Converter.whole_word_mask = get_whole_word_mask_tokens + else: + Converter.segment_func = lambda x: x + Converter.whole_word_mask = lambda x, y: x + + def process(text): + words = Converter.segment_func(text) + tokens = Converter.tokenizer.tokenize("".join(words)) + tokens = Converter.whole_word_mask(tokens, words) + tokens = Converter.tokenizer.convert_tokens_to_ids(tokens) + return tokens + + Converter.process = process + + def encode(self, json_line): + text = json.loads(json_line)[self.args.json_key] + doc_ids = [] + for sentence in Converter.splitter.tokenize(text): + sentence_ids = Converter.process(sentence.strip()) + if len(sentence_ids) > 0: + doc_ids.append(sentence_ids) + + if len(doc_ids) > 0 and self.args.append_eos: + doc_ids[-1].append(Converter.tokenizer.eos_token_id) + + return doc_ids, len(text.encode("utf-8")) + + +def main(): + args = get_args() + + file_paths = [] + if os.path.isfile(args.input_path): + file_paths.append(args.input_path) + else: + for root, _, fs in os.walk(args.input_path): + for f in fs: + file_paths.append(os.path.join(root, f)) + convert = Converter(args) + + # Try tokenizer is availiable + sample_tokenizer = getattr( + tfs, args.tokenizer_name).from_pretrained(args.model_name) + if sample_tokenizer.vocab_size < 2**16 - 1: + save_dtype = np.uint16 + else: + save_dtype = np.int32 + + pool = multiprocessing.Pool(args.workers, initializer=convert.initializer) + + # We use BytesIO to store the ids. + token_ids_stream = io.BytesIO() + sentlens_stream = io.BytesIO() + # Cumsum on tokens num + sent_cumsum_stream = io.BytesIO() + sent_cumsum_stream.write((0).to_bytes(8, byteorder='little', signed=True)) + # Cunsum on document on every sentence num, type=np.int64 + doc_cumsum_stream = io.BytesIO() + doc_cumsum_stream.write((0).to_bytes(8, byteorder='little', signed=True)) + + sent_count = 0 + token_count = 0 + + file_paths.sort() + + step = 0 + total_bytes_processed = 0 + startup_start = time.time() + for file_path in tqdm(file_paths): + if file_path.endswith(".zst"): + import zstandard + cctx = zstandard.ZstdDecompressor() + fh = open(file_path, 'rb') + text = io.BufferedReader(cctx.stream_reader(fh)) + elif file_path.endswith(".jsonl"): + text = open(file_path, 'r', encoding='utf-8') + else: + print("Unexpected data format, skiped %s" % file_path) + continue + + encoded_docs = pool.imap(convert.encode, text, 256) + print("Processing %s" % file_path) + for i, (doc, bytes_processed) in enumerate(encoded_docs, start=1): + step += 1 + total_bytes_processed += bytes_processed + if len(doc) == 0: + continue + + for sentence in doc: + sentence_len = len(sentence) + if sentence_len == 0: + continue + sentlens_stream.write( + sentence_len.to_bytes( + 4, byteorder='little', signed=True)) + token_count += sentence_len + sent_cumsum_stream.write( + token_count.to_bytes( + 8, byteorder='little', signed=True)) + sent_count += 1 + token_ids_stream.write( + np.array( + sentence, dtype=save_dtype).tobytes(order='C')) + + doc_cumsum_stream.write( + sent_count.to_bytes( + 8, byteorder='little', signed=True)) + + if step % args.log_interval == 0: + current = time.time() + elapsed = current - startup_start + mbs = total_bytes_processed / elapsed / 1024 / 1024 + print( + f"Processed {step} documents", + f"({step/elapsed:.2f} docs/s, {mbs:.4f} MB/s).", + file=sys.stderr) + + pool.close() + print("Saving tokens to files...") + all_doc_ids = np.frombuffer(token_ids_stream.getbuffer(), dtype=save_dtype) + lens = np.frombuffer(sentlens_stream.getbuffer(), dtype=np.int32) + sents = np.frombuffer(sent_cumsum_stream.getbuffer(), dtype=np.int64) + docs = np.frombuffer(doc_cumsum_stream.getbuffer(), dtype=np.int64) + np.save(args.output_prefix + "_ids.npy", all_doc_ids) + np.savez(args.output_prefix + "_idx.npz", lens=lens, sents=sents, docs=docs) + + print("Total sentences num: %d" % len(lens)) + print("Total documents num: %d" % (len(docs) - 1)) + print("Total tokens num: %d" % len(all_doc_ids)) + print("Average tokens per sentence: %.2f" % (len(all_doc_ids) / len(lens))) + print("Average tokens per document: %.2f" % (len(all_doc_ids) / + (len(docs) - 1))) + + +if __name__ == "__main__": + main() diff --git a/application/neural_search/recall/domain_adaptive_pretraining/data_tools/dataset_utils.py b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/dataset_utils.py new file mode 100644 index 000000000000..7e6655229713 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/dataset_utils.py @@ -0,0 +1,775 @@ +# coding=utf-8 +# Copyright 2018 The Google AI Language Team Authors, and NVIDIA, and PaddlePaddle Authors. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Most of the code here has been copied from: +# https://github.com/google-research/albert/blob/master/create_pretraining_data.py +# with some modifications. + +import math +import os +import re +import time +import collections + +import numpy as np +import paddle +import paddle.distributed.fleet as fleet + +print_rank_0 = print + +#from megatron.data.indexed_dataset import make_dataset as make_indexed_dataset +COMPILED = False +DSET_TYPE_BERT = 'standard_bert' +DSET_TYPE_T5 = 't5' +DSET_TYPE_ERNIE = 'ernie' + +DSET_TYPES = [DSET_TYPE_BERT, DSET_TYPE_T5, DSET_TYPE_ERNIE] + + +class MMapIndexedDataset(paddle.io.Dataset): + def __init__(self, path, skip_warmup=False): + super().__init__() + + self._path = path + + # All documment ids, extend as 1-D array. + + for suffix in ["_ids.npy", "_idx.npz"]: + if not os.path.isfile(path + suffix): + raise ValueError("File Not found, %s" % (path + suffix)) + + self._token_ids = np.load( + path + "_ids.npy", mmap_mode="r", allow_pickle=True) + process_datas = np.load(path + "_idx.npz") + self._sizes = process_datas["lens"] + self._pointers = process_datas["sents"] + self._doc_idx = process_datas["docs"] + + def __getstate__(self): + return self._path + + def __len__(self): + return len(self._sizes) + + # @lru_cache(maxsize=8) + def __getitem__(self, idx): + if isinstance(idx, int): + size = self._sizes[idx] + ptr = self._pointers[idx] + np_array = self._token_ids[ptr:ptr + size] + return np_array + + elif isinstance(idx, slice): + start, stop, step = idx.indices(len(self)) + if step != 1: + raise ValueError( + "Slices into indexed_dataset must be contiguous") + ptr = self._pointers[start] + sizes = self._sizes[idx] + offsets = list(accumulate(sizes)) + total_size = sum(sizes) + np_array = self._token_ids[ptr:ptr + total_size] + sents = np.split(np_array, offsets[:-1]) + return sents + + def get(self, idx, offset=0, length=None): + """ Retrieves a single item from the dataset with the option to only + return a portion of the item. + + get(idx) is the same as [idx] but get() does not support slicing. + """ + size = self._sizes[idx] + ptr = self._pointers[idx] + + if length is None: + length = size - offset + ptr += offset + np_array = self._token_ids[ptr:prt + length] + return np_array + + @property + def sizes(self): + return self._sizes + + @property + def doc_idx(self): + return self._doc_idx + + def get_doc_idx(self): + return self._doc_idx + + def set_doc_idx(self, doc_idx_): + self._doc_idx = doc_idx_ + + +def make_indexed_dataset(data_prefix, data_impl=None, skip_warmup=False): + return MMapIndexedDataset(data_prefix) + + +def compile_helper(): + """Compile helper function ar runtime. Make sure this + is invoked on a single process.""" + import os + import subprocess + path = os.path.abspath(os.path.dirname(__file__)) + ret = subprocess.run(['make', '-C', path]) + if ret.returncode != 0: + print("Making C++ dataset helpers module failed, exiting.") + import sys + sys.exit(1) + + +def get_a_and_b_segments(sample, np_rng): + """Divide sample into a and b segments.""" + + # Number of sentences in the sample. + n_sentences = len(sample) + # Make sure we always have two sentences. + assert n_sentences > 1, 'make sure each sample has at least two sentences.' + + # First part: + # `a_end` is how many sentences go into the `A`. + a_end = 1 + if n_sentences >= 3: + # Note that randin in numpy is exclusive. + a_end = np_rng.randint(1, n_sentences) + tokens_a = [] + for j in range(a_end): + tokens_a.extend(sample[j]) + + # Second part: + tokens_b = [] + for j in range(a_end, n_sentences): + tokens_b.extend(sample[j]) + + # Random next: + is_next_random = False + if np_rng.random() < 0.5: + is_next_random = True + tokens_a, tokens_b = tokens_b, tokens_a + + return tokens_a, tokens_b, is_next_random + + +def truncate_segments(tokens_a, tokens_b, len_a, len_b, max_num_tokens, np_rng): + """Truncates a pair of sequences to a maximum sequence length.""" + #print(len_a, len_b, max_num_tokens) + assert len_a > 0 + if len_a + len_b <= max_num_tokens: + return False + while len_a + len_b > max_num_tokens: + if len_a > len_b: + len_a -= 1 + tokens = tokens_a + else: + len_b -= 1 + tokens = tokens_b + if np_rng.random() < 0.5: + del tokens[0] + else: + tokens.pop() + return True + + +def create_tokens_and_tokentypes(tokens_a, tokens_b, cls_id, sep_id): + """Merge segments A and B, add [CLS] and [SEP] and build tokentypes.""" + + tokens = [] + tokentypes = [] + # [CLS]. + tokens.append(cls_id) + tokentypes.append(0) + # Segment A. + for token in tokens_a: + tokens.append(token) + tokentypes.append(0) + # [SEP]. + tokens.append(sep_id) + tokentypes.append(0) + # Segment B. + for token in tokens_b: + tokens.append(token) + tokentypes.append(1) + if tokens_b: + # [SEP]. + tokens.append(sep_id) + tokentypes.append(1) + + return tokens, tokentypes + + +MaskedLmInstance = collections.namedtuple("MaskedLmInstance", + ["index", "label"]) + + +def is_start_piece(piece): + """Check if the current word piece is the starting piece (BERT).""" + # When a word has been split into + # WordPieces, the first token does not have any marker and any subsequence + # tokens are prefixed with ##. So whenever we see the ## token, we + # append it to the previous set of word indexes. + return not piece.startswith("##") + + +def create_masked_lm_predictions(tokens, + vocab_id_list, + vocab_id_to_token_dict, + masked_lm_prob, + cls_id, + sep_id, + mask_id, + max_predictions_per_seq, + np_rng, + max_ngrams=3, + vocab_token_to_id_dict=None, + do_whole_word_mask=True, + favor_longer_ngram=False, + do_permutation=False, + geometric_dist=False, + to_chinese_char=False, + inplace_random_mask=False, + masking_style="bert"): + """Creates the predictions for the masked LM objective. + Note: Tokens here are vocab ids and not text tokens.""" + + cand_indexes = [] + # Note(mingdachen): We create a list for recording if the piece is + # the starting piece of current token, where 1 means true, so that + # on-the-fly whole word masking is possible. + token_boundary = [0] * len(tokens) + + for (i, token) in enumerate(tokens): + if token == cls_id or token == sep_id: + token_boundary[i] = 1 + continue + # Whole Word Masking means that if we mask all of the wordpieces + # corresponding to an original word. + # + # Note that Whole Word Masking does *not* change the training code + # at all -- we still predict each WordPiece independently, softmaxed + # over the entire vocabulary. + vocab_id = vocab_id_to_token_dict[token] + if (do_whole_word_mask and len(cand_indexes) >= 1 and + not is_start_piece(vocab_id)): + cand_indexes[-1].append(i) + else: + cand_indexes.append([i]) + if is_start_piece(vocab_id_to_token_dict[token]): + token_boundary[i] = 1 + + if to_chinese_char: + char_tokens = [] + assert vocab_token_to_id_dict is not None + for i, b in enumerate(token_boundary): + if b == 0: + vocab_id = vocab_id_to_token_dict[tokens[i]] + new_vocab_id = vocab_id[2:] if len( + re.findall('##[\u4E00-\u9FA5]', vocab_id)) > 0 else vocab_id + char_tokens.append(vocab_token_to_id_dict[new_vocab_id] + if new_vocab_id in vocab_token_to_id_dict + else token) + else: + char_tokens.append(tokens[i]) + output_tokens = list(char_tokens) + else: + output_tokens = list(tokens) + + masked_lm_positions = [] + masked_lm_labels = [] + + if masked_lm_prob == 0: + return (output_tokens, masked_lm_positions, masked_lm_labels, + token_boundary) + + num_to_predict = min(max_predictions_per_seq, + max(1, int(round(len(tokens) * masked_lm_prob)))) + + ngrams = np.arange(1, max_ngrams + 1, dtype=np.int64) + if not geometric_dist: + # Note(mingdachen): + # By default, we set the probilities to favor shorter ngram sequences. + pvals = 1. / np.arange(1, max_ngrams + 1) + pvals /= pvals.sum(keepdims=True) + if favor_longer_ngram: + pvals = pvals[::-1] + + ngram_indexes = [] + for idx in range(len(cand_indexes)): + ngram_index = [] + for n in ngrams: + ngram_index.append(cand_indexes[idx:idx + n]) + ngram_indexes.append(ngram_index) + + np_rng.shuffle(ngram_indexes) + + (masked_lms, masked_spans) = ([], []) + covered_indexes = set() + backup_output_tokens = list(output_tokens) + for cand_index_set in ngram_indexes: + if len(masked_lms) >= num_to_predict: + break + if not cand_index_set: + continue + # Note(mingdachen): + # Skip current piece if they are covered in lm masking or previous ngrams. + for index_set in cand_index_set[0]: + for index in index_set: + if index in covered_indexes: + continue + + if not geometric_dist: + n = np_rng.choice( + ngrams[:len(cand_index_set)], + p=pvals[:len(cand_index_set)] / + pvals[:len(cand_index_set)].sum(keepdims=True)) + else: + # Sampling "n" from the geometric distribution and clipping it to + # the max_ngrams. Using p=0.2 default from the SpanBERT paper + # https://arxiv.org/pdf/1907.10529.pdf (Sec 3.1) + n = min(np_rng.geometric(0.2), max_ngrams) + + index_set = sum(cand_index_set[n - 1], []) + n -= 1 + # Note(mingdachen): + # Repeatedly looking for a candidate that does not exceed the + # maximum number of predictions by trying shorter ngrams. + while len(masked_lms) + len(index_set) > num_to_predict: + if n == 0: + break + index_set = sum(cand_index_set[n - 1], []) + n -= 1 + # If adding a whole-word mask would exceed the maximum number of + # predictions, then just skip this candidate. + if len(masked_lms) + len(index_set) > num_to_predict: + continue + is_any_index_covered = False + for index in index_set: + if index in covered_indexes: + is_any_index_covered = True + break + if is_any_index_covered: + continue + for index in index_set: + covered_indexes.add(index) + masked_token = None + if masking_style == "bert": + # 80% of the time, replace with [MASK] + if np_rng.random() < 0.8: + masked_token = mask_id + else: + # 10% of the time, keep original + if np_rng.random() < 0.5: + masked_token = output_tokens[index] + # 10% of the time, replace with random word + else: + if inplace_random_mask: + masked_token = backup_output_tokens[np_rng.randint( + 0, len(output_tokens))] + else: + masked_token = vocab_id_list[np_rng.randint( + 0, len(vocab_id_list))] + elif masking_style == "t5": + masked_token = mask_id + else: + raise ValueError("invalid value of masking style") + + output_tokens[index] = masked_token + masked_lms.append( + MaskedLmInstance( + index=index, label=tokens[index])) + + masked_spans.append( + MaskedLmInstance( + index=index_set, label=[tokens[index] for index in index_set])) + + assert len(masked_lms) <= num_to_predict + np_rng.shuffle(ngram_indexes) + + select_indexes = set() + if do_permutation: + for cand_index_set in ngram_indexes: + if len(select_indexes) >= num_to_predict: + break + if not cand_index_set: + continue + # Note(mingdachen): + # Skip current piece if they are covered in lm masking or previous ngrams. + for index_set in cand_index_set[0]: + for index in index_set: + if index in covered_indexes or index in select_indexes: + continue + + n = np.random.choice( + ngrams[:len(cand_index_set)], + p=pvals[:len(cand_index_set)] / + pvals[:len(cand_index_set)].sum(keepdims=True)) + index_set = sum(cand_index_set[n - 1], []) + n -= 1 + + while len(select_indexes) + len(index_set) > num_to_predict: + if n == 0: + break + index_set = sum(cand_index_set[n - 1], []) + n -= 1 + # If adding a whole-word mask would exceed the maximum number of + # predictions, then just skip this candidate. + if len(select_indexes) + len(index_set) > num_to_predict: + continue + is_any_index_covered = False + for index in index_set: + if index in covered_indexes or index in select_indexes: + is_any_index_covered = True + break + if is_any_index_covered: + continue + for index in index_set: + select_indexes.add(index) + assert len(select_indexes) <= num_to_predict + + select_indexes = sorted(select_indexes) + permute_indexes = list(select_indexes) + np_rng.shuffle(permute_indexes) + orig_token = list(output_tokens) + + for src_i, tgt_i in zip(select_indexes, permute_indexes): + output_tokens[src_i] = orig_token[tgt_i] + masked_lms.append( + MaskedLmInstance( + index=src_i, label=orig_token[src_i])) + + masked_lms = sorted(masked_lms, key=lambda x: x.index) + # Sort the spans by the index of the first span + masked_spans = sorted(masked_spans, key=lambda x: x.index[0]) + + for p in masked_lms: + masked_lm_positions.append(p.index) + masked_lm_labels.append(p.label) + return (output_tokens, masked_lm_positions, masked_lm_labels, + token_boundary, masked_spans) + + +def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions, + masked_labels, pad_id, max_seq_length): + """Pad sequences and convert them to numpy.""" + + # Some checks. + num_tokens = len(tokens) + padding_length = max_seq_length - num_tokens + assert padding_length >= 0 + assert len(tokentypes) == num_tokens + assert len(masked_positions) == len(masked_labels) + + # Tokens and token types. + filler = [pad_id] * padding_length + tokens_np = np.array(tokens + filler, dtype=np.int64) + tokentypes_np = np.array(tokentypes + filler, dtype=np.int64) + + # Padding mask. + padding_mask_np = np.array( + [1] * num_tokens + [0] * padding_length, dtype=np.int64) + + # Lables and loss mask. + labels = [-1] * max_seq_length + loss_mask = [0] * max_seq_length + for i in range(len(masked_positions)): + assert masked_positions[i] < num_tokens + labels[masked_positions[i]] = masked_labels[i] + loss_mask[masked_positions[i]] = 1 + labels_np = np.array(labels, dtype=np.int64) + loss_mask_np = np.array(loss_mask, dtype=np.int64) + + return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np + + +def build_train_valid_test_datasets(data_prefix, + args, + tokenizer, + splits_string, + train_valid_test_num_samples, + max_seq_length, + masked_lm_prob, + short_seq_prob, + seed, + skip_warmup, + binary_head=False, + max_seq_length_dec=None, + dataset_type='standard_bert'): + + if len(data_prefix) == 1: + return _build_train_valid_test_datasets( + data_prefix[0], + args, + tokenizer, + splits_string, + train_valid_test_num_samples, + max_seq_length, + masked_lm_prob, + short_seq_prob, + seed, + skip_warmup, + binary_head, + max_seq_length_dec, + dataset_type=dataset_type) + + +def _build_train_valid_test_datasets(data_prefix, + args, + tokenizer, + splits_string, + train_valid_test_num_samples, + max_seq_length, + masked_lm_prob, + short_seq_prob, + seed, + skip_warmup, + binary_head, + max_seq_length_dec, + dataset_type='standard_bert'): + + if dataset_type not in DSET_TYPES: + raise ValueError("Invalid dataset_type: ", dataset_type) + + # Indexed dataset. + indexed_dataset = get_indexed_dataset_(data_prefix, None, skip_warmup) + + # Get start and end indices of train/valid/train into doc-idx + # Note that doc-idx is desinged to be num-docs + 1 so we can + # easily iterate over it. + total_num_of_documents = indexed_dataset.doc_idx.shape[0] - 1 + splits = get_train_valid_test_split_(splits_string, total_num_of_documents) + print(splits) + # Print stats about the splits. + print_rank_0(' > dataset split:') + + def print_split_stats(name, index): + print_rank_0(' {}:'.format(name)) + print_rank_0(' document indices in [{}, {}) total of {} ' + 'documents'.format(splits[index], splits[index + 1], + splits[index + 1] - splits[index])) + start_index = indexed_dataset.doc_idx[splits[index]] + end_index = indexed_dataset.doc_idx[splits[index + 1]] + print_rank_0(' sentence indices in [{}, {}) total of {} ' + 'sentences'.format(start_index, end_index, end_index - + start_index)) + + print_split_stats('train', 0) + print_split_stats('validation', 1) + print_split_stats('test', 2) + + def build_dataset(index, name): + # from megatron.data.bert_dataset import BertDataset + # from megatron.data.t5_dataset import T5Dataset + from .ernie_dataset import ErnieDataset + dataset = None + if splits[index + 1] > splits[index]: + # Get the pointer to the original doc-idx so we can set it later. + doc_idx_ptr = indexed_dataset.get_doc_idx() + # Slice the doc-idx + start_index = splits[index] + # Add +1 so we can index into the dataset to get the upper bound. + end_index = splits[index + 1] + 1 + # New doc_idx view. + indexed_dataset.set_doc_idx(doc_idx_ptr[start_index:end_index]) + # Build the dataset accordingly. + kwargs = dict( + name=name, + data_prefix=data_prefix, + num_epochs=None, + max_num_samples=train_valid_test_num_samples[index], + max_seq_length=max_seq_length, + seed=seed, + share_folder=args.share_folder, ) + if dataset_type == DSET_TYPE_T5: + dataset = T5Dataset( + indexed_dataset=indexed_dataset, + tokenizer=tokenizer, + masked_lm_prob=masked_lm_prob, + max_seq_length_dec=max_seq_length_dec, + short_seq_prob=short_seq_prob, + **kwargs) + elif dataset_type == DSET_TYPE_BERT: + dataset = BertDataset( + indexed_dataset=indexed_dataset, + tokenizer=tokenizer, + masked_lm_prob=masked_lm_prob, + short_seq_prob=short_seq_prob, + binary_head=binary_head, + **kwargs) + elif dataset_type == DSET_TYPE_ERNIE: + dataset = ErnieDataset( + indexed_dataset=indexed_dataset, + tokenizer=tokenizer, #ErnieTokenizer.from_pretrained("ernie-1.0"), + masked_lm_prob=masked_lm_prob, + short_seq_prob=short_seq_prob, + binary_head=binary_head, + **kwargs) + else: + raise NotImplementedError("Dataset type not fully implemented.") + + # Set the original pointer so dataset remains the main dataset. + indexed_dataset.set_doc_idx(doc_idx_ptr) + # Checks. + assert indexed_dataset.doc_idx[0] == 0 + assert indexed_dataset.doc_idx.shape[0] == \ + (total_num_of_documents + 1) + return dataset + + train_dataset = build_dataset(0, 'train') + valid_dataset = build_dataset(1, 'valid') + test_dataset = build_dataset(2, 'test') + + return (train_dataset, valid_dataset, test_dataset) + + +def get_indexed_dataset_(data_prefix, data_impl, skip_warmup): + + print_rank_0(' > building dataset index ...') + + start_time = time.time() + indexed_dataset = make_indexed_dataset(data_prefix, data_impl, skip_warmup) + assert indexed_dataset.sizes.shape[0] == indexed_dataset.doc_idx[-1] + print_rank_0(' > finished creating indexed dataset in {:4f} ' + 'seconds'.format(time.time() - start_time)) + + print_rank_0(' > indexed dataset stats:') + print_rank_0(' number of documents: {}'.format( + indexed_dataset.doc_idx.shape[0] - 1)) + print_rank_0(' number of sentences: {}'.format( + indexed_dataset.sizes.shape[0])) + + return indexed_dataset + + +def get_train_valid_test_split_(splits_string, size): + """ Get dataset splits from comma or '/' separated string list.""" + print(splits_string) + splits = [] + if splits_string.find(',') != -1: + splits = [float(s) for s in splits_string.split(',')] + elif splits_string.find('/') != -1: + splits = [float(s) for s in splits_string.split('/')] + else: + splits = [float(splits_string)] + while len(splits) < 3: + splits.append(0.) + splits = splits[:3] + splits_sum = sum(splits) + assert splits_sum > 0.0 + splits = [split / splits_sum for split in splits] + splits_index = [0] + for index, split in enumerate(splits): + splits_index.append(splits_index[index] + int( + round(split * float(size)))) + diff = splits_index[-1] - size + for index in range(1, len(splits_index)): + splits_index[index] -= diff + assert len(splits_index) == 4 + assert splits_index[-1] == size + return splits_index + + +def get_samples_mapping(indexed_dataset, data_prefix, num_epochs, + max_num_samples, max_seq_length, short_seq_prob, seed, + name, binary_head, share_folder): + """Get a list that maps a sample index to a starting sentence index, end sentence index, and length""" + + if not num_epochs: + if not max_num_samples: + raise ValueError("Need to specify either max_num_samples " + "or num_epochs") + num_epochs = np.iinfo(np.int32).max - 1 + if not max_num_samples: + max_num_samples = np.iinfo(np.int64).max - 1 + + # Filename of the index mapping + indexmap_filename = data_prefix + indexmap_filename += '_{}_indexmap'.format(name) + if num_epochs != (np.iinfo(np.int32).max - 1): + indexmap_filename += '_{}ep'.format(num_epochs) + if max_num_samples != (np.iinfo(np.int64).max - 1): + indexmap_filename += '_{}mns'.format(max_num_samples) + indexmap_filename += '_{}msl'.format(max_seq_length) + indexmap_filename += '_{:0.2f}ssp'.format(short_seq_prob) + indexmap_filename += '_{}s'.format(seed) + indexmap_filename += '.npy' + + local_rank = 0 if fleet.local_rank() is None else int(fleet.local_rank()) + if share_folder: + local_rank = fleet.worker_index() + # Build the indexed mapping if not exist. + + if local_rank == 0 and \ + not os.path.isfile(indexmap_filename): + print(' > WARNING: could not find index map file {}, building ' + 'the indices on rank 0 ...'.format(indexmap_filename)) + + # Make sure the types match the helpers input types. + assert indexed_dataset.doc_idx.dtype == np.int64 + print(indexed_dataset.sizes.dtype) + assert indexed_dataset.sizes.dtype == np.int32 + + # Build samples mapping + verbose = local_rank == 0 + start_time = time.time() + print_rank_0(' > building sapmles index mapping for {} ...'.format( + name)) + # First compile and then import. + if local_rank == 0: + compile_helper() + import data_tools.helpers as helpers + samples_mapping = helpers.build_mapping( + indexed_dataset.doc_idx, indexed_dataset.sizes, num_epochs, + max_num_samples, max_seq_length, short_seq_prob, seed, verbose, 2 + if binary_head else 1) + print_rank_0(' > done building sapmles index maping') + np.save(indexmap_filename, samples_mapping, allow_pickle=True) + print_rank_0(' > saved the index mapping in {}'.format( + indexmap_filename)) + # Make sure all the ranks have built the mapping + print_rank_0(' > elasped time to build and save samples mapping ' + '(seconds): {:4f}'.format(time.time() - start_time)) + + else: + while True: + if (not os.path.isfile(indexmap_filename)): + time.sleep(3) + else: + try: + np.load(indexmap_filename, allow_pickle=True, mmap_mode='r') + break + except Exception as e: + print( + "%s file is still writing or damaged, please wait a moment." + % indexmap_filename) + time.sleep(3) + + # This should be a barrier but nccl barrier assumes + # device_index=rank which is not the case for model + # parallel case + if paddle.distributed.get_world_size() > 1: + if paddle.fluid.framework.in_dygraph_mode(): + paddle.distributed.barrier() + + # Load indexed dataset. + print_rank_0(' > loading indexed mapping from {}'.format(indexmap_filename)) + start_time = time.time() + samples_mapping = np.load( + indexmap_filename, allow_pickle=True, mmap_mode='r') + print_rank_0(' loaded indexed file in {:3.3f} seconds'.format(time.time( + ) - start_time)) + print_rank_0(' total number of samples: {}'.format(samples_mapping.shape[ + 0])) + + return samples_mapping diff --git a/application/neural_search/recall/domain_adaptive_pretraining/data_tools/ernie_dataset.py b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/ernie_dataset.py new file mode 100644 index 000000000000..ac03102ca4f3 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/ernie_dataset.py @@ -0,0 +1,217 @@ +# Copyright (c) 2021, PadddlePaddle authors. All rights reserved. +# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""BERT Style dataset.""" + +import numpy as np +import paddle +import re + +from .dataset_utils import ( + get_samples_mapping, + get_a_and_b_segments, + truncate_segments, + create_tokens_and_tokentypes, + create_masked_lm_predictions, + make_indexed_dataset, + get_indexed_dataset_, ) + +from paddlenlp.transformers import ErnieTokenizer + + +class ErnieDataset(paddle.io.Dataset): + def __init__(self, + name, + tokenizer, + indexed_dataset, + data_prefix, + num_epochs, + max_num_samples, + masked_lm_prob, + max_seq_length, + short_seq_prob, + seed, + binary_head, + share_folder=False): + + # Params to store. + self.name = name + self.seed = seed + self.masked_lm_prob = masked_lm_prob + self.max_seq_length = max_seq_length + self.binary_head = binary_head + self.share_folder = share_folder + + # Dataset. + self.indexed_dataset = indexed_dataset + + # Build the samples mapping. + self.samples_mapping = get_samples_mapping( + self.indexed_dataset, + data_prefix, + num_epochs, + max_num_samples, + self.max_seq_length - 3, # account for added tokens + short_seq_prob, + self.seed, + self.name, + self.binary_head, + self.share_folder) + + # Vocab stuff. + # tokenizer = get_tokenizer() + # self.vocab_id_list = list(tokenizer.inv_vocab.keys()) + # self.vocab_id_to_token_dict = tokenizer.inv_vocab + self.vocab_id_list = list(tokenizer.vocab.idx_to_token.keys()) + self.vocab_id_to_token_dict = tokenizer.vocab.idx_to_token + self.vocab_token_to_id_dict = tokenizer.vocab.token_to_idx + + self.cls_id = tokenizer.cls_token_id + self.sep_id = tokenizer.sep_token_id + self.mask_id = tokenizer.mask_token_id + self.pad_id = tokenizer.pad_token_id + + def __len__(self): + return self.samples_mapping.shape[0] + + def __getitem__(self, idx): + start_idx, end_idx, seq_length = self.samples_mapping[idx] + sample = [self.indexed_dataset[i] for i in range(start_idx, end_idx)] + # Note that this rng state should be numpy and not python since + # python randint is inclusive whereas the numpy one is exclusive. + # We % 2**32 since numpy requres the seed to be between 0 and 2**32 - 1 + np_rng = np.random.RandomState(seed=((self.seed + idx) % 2**32)) + return build_training_sample( + sample, + seq_length, + self.max_seq_length, # needed for padding + self.vocab_id_list, + self.vocab_id_to_token_dict, + self.vocab_token_to_id_dict, + self.cls_id, + self.sep_id, + self.mask_id, + self.pad_id, + self.masked_lm_prob, + np_rng, + self.binary_head) + + +def build_training_sample(sample, target_seq_length, max_seq_length, + vocab_id_list, vocab_id_to_token_dict, + vocab_token_to_id_dict, cls_id, sep_id, mask_id, + pad_id, masked_lm_prob, np_rng, binary_head): + """Biuld training sample. + + Arguments: + sample: A list of sentences in which each sentence is a list token ids. + target_seq_length: Desired sequence length. + max_seq_length: Maximum length of the sequence. All values are padded to + this length. + vocab_id_list: List of vocabulary ids. Used to pick a random id. + vocab_id_to_token_dict: A dictionary from vocab ids to text tokens. + vocab_token_to_id_dict: A dictionary from text tokens to vocab ids. + cls_id: Start of example id. + sep_id: Separator id. + mask_id: Mask token id. + pad_id: Padding token id. + masked_lm_prob: Probability to mask tokens. + np_rng: Random number genenrator. Note that this rng state should be + numpy and not python since python randint is inclusive for + the opper bound whereas the numpy one is exclusive. + """ + + if binary_head: + # We assume that we have at least two sentences in the sample + assert len(sample) > 1, "The sentence num should be large than 1." + assert target_seq_length <= max_seq_length + + # Divide sample into two segments (A and B). + if binary_head: + tokens_a, tokens_b, is_next_random = get_a_and_b_segments(sample, + np_rng) + else: + tokens_a = [] + for j in range(len(sample)): + tokens_a.extend(sample[j]) + tokens_b = [] + is_next_random = False + + # Truncate to `target_sequence_length`. + max_num_tokens = target_seq_length + truncated = truncate_segments(tokens_a, tokens_b, + len(tokens_a), + len(tokens_b), max_num_tokens, np_rng) + + # Build tokens and toketypes. + tokens, tokentypes = create_tokens_and_tokentypes(tokens_a, tokens_b, + cls_id, sep_id) + + # Masking. + max_predictions_per_seq = masked_lm_prob * max_num_tokens + (tokens, masked_positions, masked_labels, _, + _) = create_masked_lm_predictions( + tokens, + vocab_id_list, + vocab_id_to_token_dict, + masked_lm_prob, + cls_id, + sep_id, + mask_id, + max_predictions_per_seq, + np_rng, + vocab_token_to_id_dict=vocab_token_to_id_dict, + to_chinese_char=True, + inplace_random_mask=False, ) + + # Padding. + tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np \ + = pad_and_convert_to_numpy(tokens, tokentypes, masked_positions, + masked_labels, pad_id, max_seq_length) + + return tokens_np, tokentypes_np, padding_mask_np, masked_positions, masked_labels, int( + is_next_random) + + +def pad_and_convert_to_numpy(tokens, tokentypes, masked_positions, + masked_labels, pad_id, max_seq_length): + """Pad sequences and convert them to numpy.""" + + # Some checks. + num_tokens = len(tokens) + padding_length = max_seq_length - num_tokens + assert padding_length >= 0 + assert len(tokentypes) == num_tokens + assert len(masked_positions) == len(masked_labels) + + # Tokens and token types. + filler = [pad_id] * padding_length + tokens_np = np.array(tokens + filler, dtype=np.int64) + tokentypes_np = np.array(tokentypes + filler, dtype=np.int64) + + # Padding mask. + padding_mask_np = np.array( + [1] * num_tokens + [0] * padding_length, dtype=np.float32) + padding_mask_np = padding_mask_np.reshape([1, 1, -1]) + # Lables and loss mask. + labels = [-1] * max_seq_length + loss_mask = [0] * max_seq_length + for i in range(len(masked_positions)): + assert masked_positions[i] < num_tokens + labels[masked_positions[i]] = masked_labels[i] + loss_mask[masked_positions[i]] = 1 + labels_np = np.array(labels, dtype=np.int64) + loss_mask_np = np.array(loss_mask, dtype=np.int64) + + return tokens_np, tokentypes_np, labels_np, padding_mask_np, loss_mask_np diff --git a/application/neural_search/recall/domain_adaptive_pretraining/data_tools/helpers.cpp b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/helpers.cpp new file mode 100644 index 000000000000..1b7c9b5e50d9 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/helpers.cpp @@ -0,0 +1,736 @@ +/* + coding=utf-8 + Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + */ + + +/* Helper methods for fast index mapping builds */ + +#include +#include +#include +#include +#include +#include +#include +#include + +namespace py = pybind11; +using namespace std; + +const int32_t LONG_SENTENCE_LEN = 512; + + +void build_blending_indices(py::array_t& dataset_index, + py::array_t& dataset_sample_index, + const py::array_t& weights, + const int32_t num_datasets, + const int64_t size, + const bool verbose) { + /* Given multiple datasets and a weighting array, build samples + such that it follows those wieghts.*/ + + if (verbose) { + std::cout << "> building indices for blendable datasets ..." << std::endl; + } + + // Get the pointer access without the checks. + auto dataset_index_ptr = dataset_index.mutable_unchecked<1>(); + auto dataset_sample_index_ptr = dataset_sample_index.mutable_unchecked<1>(); + auto weights_ptr = weights.unchecked<1>(); + + // Initialize buffer for number of samples used for each dataset. + int64_t current_samples[num_datasets]; + for (int64_t i = 0; i < num_datasets; ++i) { + current_samples[i] = 0; + } + + // For each sample: + for (int64_t sample_idx = 0; sample_idx < size; ++sample_idx) { + // Determine where the max error in sampling is happening. + auto sample_idx_double = std::max(static_cast(sample_idx), 1.0); + int64_t max_error_index = 0; + double max_error = weights_ptr[0] * sample_idx_double - + static_cast(current_samples[0]); + for (int64_t dataset_idx = 1; dataset_idx < num_datasets; ++dataset_idx) { + double error = weights_ptr[dataset_idx] * sample_idx_double - + static_cast(current_samples[dataset_idx]); + if (error > max_error) { + max_error = error; + max_error_index = dataset_idx; + } + } + + // Populate the indices. + dataset_index_ptr[sample_idx] = static_cast(max_error_index); + dataset_sample_index_ptr[sample_idx] = current_samples[max_error_index]; + + // Update the total samples. + current_samples[max_error_index] += 1; + } + + // print info + if (verbose) { + std::cout << " > sample ratios:" << std::endl; + for (int64_t dataset_idx = 0; dataset_idx < num_datasets; ++dataset_idx) { + auto ratio = static_cast(current_samples[dataset_idx]) / + static_cast(size); + std::cout << " dataset " << dataset_idx + << ", input: " << weights_ptr[dataset_idx] + << ", achieved: " << ratio << std::endl; + } + } +} + + +py::array build_sample_idx(const py::array_t& sizes_, + const py::array_t& doc_idx_, + const int32_t seq_length, + const int32_t num_epochs, + const int64_t tokens_per_epoch) { + /* Sample index (sample_idx) is used for gpt2 like dataset for which + the documents are flattened and the samples are built based on this + 1-D flatten array. It is a 2D array with sizes [number-of-samples + 1, 2] + where [..., 0] contains the index into `doc_idx` and [..., 1] is the + starting offset in that document.*/ + + // Consistency checks. + assert(seq_length > 1); + assert(num_epochs > 0); + assert(tokens_per_epoch > 1); + + // Remove bound checks. + auto sizes = sizes_.unchecked<1>(); + auto doc_idx = doc_idx_.unchecked<1>(); + + // Mapping and it's length (1D). + int64_t num_samples = (num_epochs * tokens_per_epoch - 1) / seq_length; + int32_t* sample_idx = new int32_t[2 * (num_samples + 1)]; + + cout << " using:" << endl << std::flush; + cout << " number of documents: " << doc_idx_.shape(0) / num_epochs + << endl + << std::flush; + cout << " number of epochs: " << num_epochs << endl + << std::flush; + cout << " sequence length: " << seq_length << endl + << std::flush; + cout << " total number of samples: " << num_samples << endl + << std::flush; + + // Index into sample_idx. + int64_t sample_index = 0; + // Index into doc_idx. + int64_t doc_idx_index = 0; + // Begining offset for each document. + int32_t doc_offset = 0; + // Start with first document and no offset. + sample_idx[2 * sample_index] = doc_idx_index; + sample_idx[2 * sample_index + 1] = doc_offset; + ++sample_index; + + while (sample_index <= num_samples) { + // Start with a fresh sequence. + int32_t remaining_seq_length = seq_length + 1; + while (remaining_seq_length != 0) { + // Get the document length. + auto doc_id = doc_idx[doc_idx_index]; + auto doc_length = sizes[doc_id] - doc_offset; + // And add it to the current sequence. + remaining_seq_length -= doc_length; + // If we have more than a full sequence, adjust offset and set + // remaining length to zero so we return from the while loop. + // Note that -1 here is for the same reason we have -1 in + // `_num_epochs` calculations. + if (remaining_seq_length <= 0) { + doc_offset += (remaining_seq_length + doc_length - 1); + remaining_seq_length = 0; + } else { + // Otherwise, start from the begining of the next document. + ++doc_idx_index; + doc_offset = 0; + } + } + // Record the sequence. + sample_idx[2 * sample_index] = doc_idx_index; + sample_idx[2 * sample_index + 1] = doc_offset; + ++sample_index; + } + + // Method to deallocate memory. + py::capsule free_when_done(sample_idx, [](void* mem_) { + int32_t* mem = reinterpret_cast(mem_); + delete[] mem; + }); + + // Return the numpy array. + const auto byte_size = sizeof(int32_t); + return py::array(std::vector{num_samples + 1, 2}, // shape + {2 * byte_size, byte_size}, // C-style contiguous strides + sample_idx, // the data pointer + free_when_done); // numpy array references +} + + +inline int32_t get_target_sample_len(const int32_t short_seq_ratio, + const int32_t max_length, + std::mt19937& rand32_gen) { + /* Training sample length. */ + if (short_seq_ratio == 0) { + return max_length; + } + const auto random_number = rand32_gen(); + if ((random_number % short_seq_ratio) == 0) { + return 2 + random_number % (max_length - 1); + } + return max_length; +} + + +template +py::array build_mapping_impl(const py::array_t& docs_, + const py::array_t& sizes_, + const int32_t num_epochs, + const uint64_t max_num_samples, + const int32_t max_seq_length, + const double short_seq_prob, + const int32_t seed, + const bool verbose, + const int32_t min_num_sent) { + /* Build a mapping of (start-index, end-index, sequence-length) where + start and end index are the indices of the sentences in the sample + and sequence-length is the target sequence length. + */ + + // Consistency checks. + assert(num_epochs > 0); + assert(max_seq_length > 1); + assert(short_seq_prob >= 0.0); + assert(short_seq_prob <= 1.0); + assert(seed > 0); + + // Remove bound checks. + auto docs = docs_.unchecked<1>(); + auto sizes = sizes_.unchecked<1>(); + + // For efficiency, convert probability to ratio. Note: rand() generates int. + int32_t short_seq_ratio = 0; + if (short_seq_prob > 0) { + short_seq_ratio = static_cast(round(1.0 / short_seq_prob)); + } + + if (verbose) { + const auto sent_start_index = docs[0]; + const auto sent_end_index = docs[docs_.shape(0) - 1]; + const auto num_sentences = sent_end_index - sent_start_index; + cout << " using:" << endl << std::flush; + cout << " number of documents: " << docs_.shape(0) - 1 + << endl + << std::flush; + cout << " sentences range: [" << sent_start_index << ", " + << sent_end_index << ")" << endl + << std::flush; + cout << " total number of sentences: " << num_sentences << endl + << std::flush; + cout << " number of epochs: " << num_epochs << endl + << std::flush; + cout << " maximum number of samples: " << max_num_samples << endl + << std::flush; + cout << " maximum sequence length: " << max_seq_length << endl + << std::flush; + cout << " short sequence probability: " << short_seq_prob << endl + << std::flush; + cout << " short sequence ration (1/prob): " << short_seq_ratio << endl + << std::flush; + cout << " seed: " << seed << endl + << std::flush; + } + + // Mapping and it's length (1D). + int64_t num_samples = -1; + DocIdx* maps = NULL; + + // Perform two iterations, in the first iteration get the size + // and allocate memory and in the second iteration populate the map. + bool second = false; + for (int32_t iteration = 0; iteration < 2; ++iteration) { + // Set the seed so both iterations produce the same results. + std::mt19937 rand32_gen(seed); + + // Set the flag on second iteration. + second = (iteration == 1); + + // Counters: + uint64_t empty_docs = 0; + uint64_t one_sent_docs = 0; + uint64_t long_sent_docs = 0; + + // Current map index. + uint64_t map_index = 0; + + // For each epoch: + for (int32_t epoch = 0; epoch < num_epochs; ++epoch) { + if (map_index >= max_num_samples) { + if (verbose && (!second)) { + cout << " reached " << max_num_samples << " samples after " + << epoch << " epochs ..." << endl + << std::flush; + } + break; + } + // For each document: + for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc) { + // Document sentences are in [sent_index_first, sent_index_last) + const auto sent_index_first = docs[doc]; + const auto sent_index_last = docs[doc + 1]; + + // At the begining of the document previous index is the + // start index. + auto prev_start_index = sent_index_first; + + // Remaining documents. + auto num_remain_sent = sent_index_last - sent_index_first; + + // Some bookkeeping + if ((epoch == 0) && (!second)) { + if (num_remain_sent == 0) { + ++empty_docs; + } + if (num_remain_sent == 1) { + ++one_sent_docs; + } + } + + // Detect documents with long sentences. + bool contains_long_sentence = false; + if (num_remain_sent > 1) { + for (auto sent_index = sent_index_first; sent_index < sent_index_last; + ++sent_index) { + if (sizes[sent_index] > LONG_SENTENCE_LEN) { + if ((epoch == 0) && (!second)) { + ++long_sent_docs; + } + contains_long_sentence = true; + break; + } + } + } + + // If we have more than two sentences. + if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence)) { + // Set values. + auto seq_len = int32_t{0}; + auto num_sent = int32_t{0}; + auto target_seq_len = get_target_sample_len( + short_seq_ratio, max_seq_length, rand32_gen); + + // Loop through sentences. + for (auto sent_index = sent_index_first; sent_index < sent_index_last; + ++sent_index) { + // Add the size and number of sentences. + seq_len += sizes[sent_index]; + ++num_sent; + --num_remain_sent; + + // If we have reached the target length. + // and if not only one sentence is left in the document. + // and if we have at least two sentneces. + // and if we have reached end of the document. + if (((seq_len >= target_seq_len) && (num_remain_sent > 1) && + (num_sent >= min_num_sent)) || + (num_remain_sent == 0)) { + // Check for overflow. + if ((3 * map_index + 2) > std::numeric_limits::max()) { + cout << "number of samples exceeded maximum " + << "allowed by type int64: " + << std::numeric_limits::max() << endl; + throw std::overflow_error("Number of samples"); + } + + // Populate the map. + if (second) { + const auto map_index_0 = 3 * map_index; + maps[map_index_0] = static_cast(prev_start_index); + maps[map_index_0 + 1] = static_cast(sent_index + 1); + maps[map_index_0 + 2] = static_cast(target_seq_len); + } + + // Update indices / counters. + ++map_index; + prev_start_index = sent_index + 1; + target_seq_len = get_target_sample_len( + short_seq_ratio, max_seq_length, rand32_gen); + seq_len = 0; + num_sent = 0; + } + + } // for (auto sent_index=sent_index_first; ... + } // if (num_remain_sent > 1) { + } // for (int doc=0; doc < num_docs; ++doc) { + } // for (int epoch=0; epoch < num_epochs; ++epoch) { + + if (!second) { + if (verbose) { + cout << " number of empty documents: " << empty_docs << endl + << std::flush; + cout << " number of documents with one sentence: " << one_sent_docs + << endl + << std::flush; + cout << " number of documents with long sentences: " << long_sent_docs + << endl + << std::flush; + cout << " will create mapping for " << map_index << " samples" << endl + << std::flush; + } + assert(maps == NULL); + assert(num_samples < 0); + maps = new DocIdx[3 * map_index]; + num_samples = static_cast(map_index); + } + + } // for (int iteration=0; iteration < 2; ++iteration) { + + // Shuffle. + // We need a 64 bit random number generator as we might have more + // than 2 billion samples. + std::mt19937_64 rand64_gen(seed + 1); + for (auto i = (num_samples - 1); i > 0; --i) { + const auto j = static_cast(rand64_gen() % (i + 1)); + const auto i0 = 3 * i; + const auto j0 = 3 * j; + // Swap values. + swap(maps[i0], maps[j0]); + swap(maps[i0 + 1], maps[j0 + 1]); + swap(maps[i0 + 2], maps[j0 + 2]); + } + + // Method to deallocate memory. + py::capsule free_when_done(maps, [](void* mem_) { + DocIdx* mem = reinterpret_cast(mem_); + delete[] mem; + }); + + // Return the numpy array. + const auto byte_size = sizeof(DocIdx); + return py::array(std::vector{num_samples, 3}, // shape + {3 * byte_size, byte_size}, // C-style contiguous strides + maps, // the data pointer + free_when_done); // numpy array references +} + + +py::array build_mapping(const py::array_t& docs_, + const py::array_t& sizes_, + const int num_epochs, + const uint64_t max_num_samples, + const int max_seq_length, + const double short_seq_prob, + const int seed, + const bool verbose, + const int32_t min_num_sent) { + if (sizes_.size() > std::numeric_limits::max()) { + if (verbose) { + cout << " using uint64 for data mapping..." << endl << std::flush; + } + return build_mapping_impl(docs_, + sizes_, + num_epochs, + max_num_samples, + max_seq_length, + short_seq_prob, + seed, + verbose, + min_num_sent); + } else { + if (verbose) { + cout << " using uint32 for data mapping..." << endl << std::flush; + } + return build_mapping_impl(docs_, + sizes_, + num_epochs, + max_num_samples, + max_seq_length, + short_seq_prob, + seed, + verbose, + min_num_sent); + } +} + +template +py::array build_blocks_mapping_impl(const py::array_t& docs_, + const py::array_t& sizes_, + const py::array_t& titles_sizes_, + const int32_t num_epochs, + const uint64_t max_num_samples, + const int32_t max_seq_length, + const int32_t seed, + const bool verbose, + const bool use_one_sent_blocks) { + /* Build a mapping of (start-index, end-index, sequence-length) where + start and end index are the indices of the sentences in the sample + and sequence-length is the target sequence length. + */ + + // Consistency checks. + assert(num_epochs > 0); + assert(max_seq_length > 1); + assert(seed > 0); + + // Remove bound checks. + auto docs = docs_.unchecked<1>(); + auto sizes = sizes_.unchecked<1>(); + auto titles_sizes = titles_sizes_.unchecked<1>(); + + if (verbose) { + const auto sent_start_index = docs[0]; + const auto sent_end_index = docs[docs_.shape(0) - 1]; + const auto num_sentences = sent_end_index - sent_start_index; + cout << " using:" << endl << std::flush; + cout << " number of documents: " << docs_.shape(0) - 1 + << endl + << std::flush; + cout << " sentences range: [" << sent_start_index << ", " + << sent_end_index << ")" << endl + << std::flush; + cout << " total number of sentences: " << num_sentences << endl + << std::flush; + cout << " number of epochs: " << num_epochs << endl + << std::flush; + cout << " maximum number of samples: " << max_num_samples << endl + << std::flush; + cout << " maximum sequence length: " << max_seq_length << endl + << std::flush; + cout << " seed: " << seed << endl + << std::flush; + } + + // Mapping and its length (1D). + int64_t num_samples = -1; + DocIdx* maps = NULL; + + // Acceptable number of sentences per block. + int min_num_sent = 2; + if (use_one_sent_blocks) { + min_num_sent = 1; + } + + // Perform two iterations, in the first iteration get the size + // and allocate memory and in the second iteration populate the map. + bool second = false; + for (int32_t iteration = 0; iteration < 2; ++iteration) { + // Set the flag on second iteration. + second = (iteration == 1); + + // Current map index. + uint64_t map_index = 0; + + uint64_t empty_docs = 0; + uint64_t one_sent_docs = 0; + uint64_t long_sent_docs = 0; + // For each epoch: + for (int32_t epoch = 0; epoch < num_epochs; ++epoch) { + // assign every block a unique id + int32_t block_id = 0; + + if (map_index >= max_num_samples) { + if (verbose && (!second)) { + cout << " reached " << max_num_samples << " samples after " + << epoch << " epochs ..." << endl + << std::flush; + } + break; + } + // For each document: + for (int32_t doc = 0; doc < (docs.shape(0) - 1); ++doc) { + // Document sentences are in [sent_index_first, sent_index_last) + const auto sent_index_first = docs[doc]; + const auto sent_index_last = docs[doc + 1]; + const auto target_seq_len = max_seq_length - titles_sizes[doc]; + + // At the begining of the document previous index is the + // start index. + auto prev_start_index = sent_index_first; + + // Remaining documents. + auto num_remain_sent = sent_index_last - sent_index_first; + + // Some bookkeeping + if ((epoch == 0) && (!second)) { + if (num_remain_sent == 0) { + ++empty_docs; + } + if (num_remain_sent == 1) { + ++one_sent_docs; + } + } + // Detect documents with long sentences. + bool contains_long_sentence = false; + if (num_remain_sent >= min_num_sent) { + for (auto sent_index = sent_index_first; sent_index < sent_index_last; + ++sent_index) { + if (sizes[sent_index] > LONG_SENTENCE_LEN) { + if ((epoch == 0) && (!second)) { + ++long_sent_docs; + } + contains_long_sentence = true; + break; + } + } + } + // If we have enough sentences and no long sentences. + if ((num_remain_sent >= min_num_sent) && (!contains_long_sentence)) { + // Set values. + auto seq_len = int32_t{0}; + auto num_sent = int32_t{0}; + + // Loop through sentences. + for (auto sent_index = sent_index_first; sent_index < sent_index_last; + ++sent_index) { + // Add the size and number of sentences. + seq_len += sizes[sent_index]; + ++num_sent; + --num_remain_sent; + + // If we have reached the target length. + // and there are an acceptable number of sentences left + // and if we have at least the minimum number of sentences. + // or if we have reached end of the document. + if (((seq_len >= target_seq_len) && + (num_remain_sent >= min_num_sent) && + (num_sent >= min_num_sent)) || + (num_remain_sent == 0)) { + // Populate the map. + if (second) { + const auto map_index_0 = 4 * map_index; + // Each sample has 4 items: the starting sentence index, ending + // sentence index, + // the index of the document from which the block comes (used + // for fetching titles) + // and the unique id of the block (used for creating block + // indexes) + + maps[map_index_0] = static_cast(prev_start_index); + maps[map_index_0 + 1] = static_cast(sent_index + 1); + maps[map_index_0 + 2] = static_cast(doc); + maps[map_index_0 + 3] = static_cast(block_id); + } + + // Update indices / counters. + ++map_index; + ++block_id; + prev_start_index = sent_index + 1; + seq_len = 0; + num_sent = 0; + } + } // for (auto sent_index=sent_index_first; ... + } // if (num_remain_sent > 1) { + } // for (int doc=0; doc < num_docs; ++doc) { + } // for (int epoch=0; epoch < num_epochs; ++epoch) { + + if (!second) { + if (verbose) { + cout << " number of empty documents: " << empty_docs << endl + << std::flush; + cout << " number of documents with one sentence: " << one_sent_docs + << endl + << std::flush; + cout << " number of documents with long sentences: " << long_sent_docs + << endl + << std::flush; + cout << " will create mapping for " << map_index << " samples" << endl + << std::flush; + } + assert(maps == NULL); + assert(num_samples < 0); + maps = new DocIdx[4 * map_index]; + num_samples = static_cast(map_index); + } + + } // for (int iteration=0; iteration < 2; ++iteration) { + + // Shuffle. + // We need a 64 bit random number generator as we might have more + // than 2 billion samples. + std::mt19937_64 rand64_gen(seed + 1); + for (auto i = (num_samples - 1); i > 0; --i) { + const auto j = static_cast(rand64_gen() % (i + 1)); + const auto i0 = 4 * i; + const auto j0 = 4 * j; + // Swap values. + swap(maps[i0], maps[j0]); + swap(maps[i0 + 1], maps[j0 + 1]); + swap(maps[i0 + 2], maps[j0 + 2]); + swap(maps[i0 + 3], maps[j0 + 3]); + } + + // Method to deallocate memory. + py::capsule free_when_done(maps, [](void* mem_) { + DocIdx* mem = reinterpret_cast(mem_); + delete[] mem; + }); + + // Return the numpy array. + const auto byte_size = sizeof(DocIdx); + return py::array(std::vector{num_samples, 4}, // shape + {4 * byte_size, byte_size}, // C-style contiguous strides + maps, // the data pointer + free_when_done); // numpy array references +} + +py::array build_blocks_mapping(const py::array_t& docs_, + const py::array_t& sizes_, + const py::array_t& titles_sizes_, + const int num_epochs, + const uint64_t max_num_samples, + const int max_seq_length, + const int seed, + const bool verbose, + const bool use_one_sent_blocks) { + if (sizes_.size() > std::numeric_limits::max()) { + if (verbose) { + cout << " using uint64 for data mapping..." << endl << std::flush; + } + return build_blocks_mapping_impl(docs_, + sizes_, + titles_sizes_, + num_epochs, + max_num_samples, + max_seq_length, + seed, + verbose, + use_one_sent_blocks); + } else { + if (verbose) { + cout << " using uint32 for data mapping..." << endl << std::flush; + } + return build_blocks_mapping_impl(docs_, + sizes_, + titles_sizes_, + num_epochs, + max_num_samples, + max_seq_length, + seed, + verbose, + use_one_sent_blocks); + } +} + +PYBIND11_MODULE(helpers, m) { + m.def("build_mapping", &build_mapping); + m.def("build_blocks_mapping", &build_blocks_mapping); + m.def("build_sample_idx", &build_sample_idx); + m.def("build_blending_indices", &build_blending_indices); +} diff --git a/application/neural_search/recall/domain_adaptive_pretraining/data_tools/process_data.py b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/process_data.py new file mode 100644 index 000000000000..5c2647ddc516 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/process_data.py @@ -0,0 +1,30 @@ +import pandas as pd +from sklearn.model_selection import train_test_split + +def create_pretraining_data(): + file_name='data/wanfang_text.csv' + data=pd.read_csv(file_name,sep='\t') + data=data.drop_duplicates() + print(data.shape) + data.to_csv('data/pretrain_data.csv',sep='\t',index=False) + +def process_data(): + ouput=open('wanfangdata/wanfang_text.txt','w') + with open('data/pretrain_data.csv') as f: + for i,item in enumerate(f.readlines()): + if(i==0): + continue + arr=item.strip().split('\t') + # queryText + ouput.write(arr[-2]+'\n') + ouput.write('\n') + # title + ouput.write(arr[-1]+'\n') + # abstract + ouput.write(arr[-3]+'\n') + ouput.write('\n') + + +if __name__=="__main__": + create_pretraining_data() + process_data() diff --git a/application/neural_search/recall/domain_adaptive_pretraining/data_tools/trans_to_json.py b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/trans_to_json.py new file mode 100644 index 000000000000..6edce0e5df31 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/data_tools/trans_to_json.py @@ -0,0 +1,180 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import re +import argparse +import json +import multiprocessing +import sys +import time +import shutil +from functools import partial + +import numpy as np +from tqdm import tqdm + + +def get_args(): + parser = argparse.ArgumentParser() + parser.add_argument( + '--input_path', + type=str, + required=True, + help='Path to you raw files. Folder or file path.') + parser.add_argument( + '--output_path', + type=str, + required=True, + help='Path to save the output json files.') + parser.add_argument( + '--json_key', + type=str, + default='text', + help='The content key of json file.') + parser.add_argument( + '--doc_spliter', + type=str, + default='', + help="Spliter between documents. We will strip the line, if you use blank line to split doc, leave it blank." + ) + parser.add_argument( + '--min_doc_length', + type=int, + default=10, + help="Minimal char of a documment.") + parser.add_argument( + '--workers', + type=int, + default=1, + help='Number of worker processes to launch') + parser.add_argument( + '--log_interval', + type=int, + default=1, + help='Interval between progress updates.') + parser.add_argument( + '--no-merge', action='store_true', help='Don\'t merge the file.') + parser.add_argument( + '--no-shuffle', action='store_true', help='Don\'t shuffle the file.') + args = parser.parse_args() + return args + + +def raw_text_to_json(path, doc_spliter="", json_key="text", min_doc_length=10): + path = os.path.abspath(path) + if not os.path.exists(path): + print("No found file %s" % path) + return 0, None + + out_filepath = path + ".jsonl" + fout = open(out_filepath, "w", encoding="utf-8") + len_files = 0 + with open(path, "r") as f: + doc = "" + line = f.readline() + while line: + len_files += len(line) + if line.strip() == doc_spliter: + if len(doc) > min_doc_length: + fout.write( + json.dumps( + { + json_key: doc + }, ensure_ascii=False) + "\n") + doc = "" + else: + doc += line + line = f.readline() + + if len(doc) > min_doc_length: + fout.write(json.dumps({json_key: doc}, ensure_ascii=False) + "\n") + doc = "" + + return len_files, out_filepath + + +def merge_file(file_paths, output_path): + if not output_path.endswith(".jsonl"): + output_path = output_path + ".jsonl" + print("Merging files into %s" % output_path) + with open(output_path, 'wb') as wfd: + for f in file_paths: + if f is not None and os.path.exists(f): + with open(f, 'rb') as fd: + shutil.copyfileobj(fd, wfd) + os.remove(f) + print("File save in %s" % output_path) + return output_path + + +def shuffle_file(output_path): + print("Shuffling the jsonl file...") + if os.path.exists(output_path): + os.system("shuf %s -o %s" % (output_path, output_path)) + print("File shuffled!!!") + else: + raise ValueError("File not found: %s" % output_path) + + +def main(): + args = get_args() + startup_start = time.time() + + file_paths = [] + if os.path.isfile(args.input_path): + file_paths.append(args.input_path) + else: + for root, _, fs in os.walk(args.input_path): + for f in fs: + file_paths.append(os.path.join(root, f)) + + pool = multiprocessing.Pool(args.workers) + + startup_end = time.time() + proc_start = time.time() + total_bytes_processed = 0 + print("Time to startup:", startup_end - startup_start) + + trans_json = partial( + raw_text_to_json, + doc_spliter=args.doc_spliter, + json_key=args.json_key, + min_doc_length=args.min_doc_length) + encoded_files = pool.imap(trans_json, file_paths, 1) + + out_paths = [] + for i, (bytes_processed, out_path) in enumerate(encoded_files, start=1): + total_bytes_processed += bytes_processed + out_paths.append(out_path) + master_start = time.time() + + if i % args.log_interval == 0: + current = time.time() + elapsed = current - proc_start + mbs = total_bytes_processed / elapsed / 1024 / 1024 + print( + f"Processed {i} files", + f"({i/elapsed} files/s, {mbs} MB/s).", + file=sys.stderr) + + if not args.no_merge: + output_path = merge_file(out_paths, args.output_path) + if not args.no_shuffle: + shuffle_file(output_path) + + +if __name__ == "__main__": + main() + #profile.run("main()", "testprof") diff --git a/application/neural_search/recall/domain_adaptive_pretraining/ernie_static_to_dynamic.py b/application/neural_search/recall/domain_adaptive_pretraining/ernie_static_to_dynamic.py new file mode 100644 index 000000000000..4346778c3de6 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/ernie_static_to_dynamic.py @@ -0,0 +1,41 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from paddlenlp.utils.tools import static_params_to_dygraph +import paddle +from paddlenlp.transformers import ErnieModel, ErnieForPretraining, ErniePretrainingCriterion, ErnieTokenizer +import os +import paddle +import paddle.static as static +import paddle.nn as nn + + +def load_ernie_model(static_model_path): + + model=ErnieModel.from_pretrained('ernie-1.0') + program_state = static.load_program_state(static_model_path) + ret_dict=static_params_to_dygraph(model,program_state) + + print('转换前的参数:') + print(model.embeddings.word_embeddings.weight ) + model.load_dict(ret_dict) + print('转换后的参数:') + print(model.embeddings.word_embeddings.weight ) + model.save_pretrained("./ernie_checkpoint") + + + +if __name__ == "__main__": + static_model_path="./output/ernie-1.0-dp8-gb1024/model_last/static_vars" + load_ernie_model(static_model_path) \ No newline at end of file diff --git a/application/neural_search/recall/domain_adaptive_pretraining/run_pretrain_static.py b/application/neural_search/recall/domain_adaptive_pretraining/run_pretrain_static.py new file mode 100644 index 000000000000..2c59149a97a6 --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/run_pretrain_static.py @@ -0,0 +1,676 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Pretrain ERNIE in static graph mode. +""" +import argparse +import math +import os +import random +import time +import yaml +import shutil +import collections + +import numpy as np +import paddle +import paddle.distributed.fleet as fleet +from paddle.distributed.fleet.meta_optimizers.sharding.utils import save_persistables +from paddle.io import DataLoader, Dataset +from paddlenlp.utils.batch_sampler import DistributedBatchSampler +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.utils.log import logger + +from paddlenlp.transformers import ErnieModel, ErnieForPretraining, ErniePretrainingCriterion, ErnieTokenizer + +from paddlenlp.transformers import CosineAnnealingWithWarmupDecay, LinearDecayWithWarmup + +from paddlenlp.ops import guard, Topology, get_rng_state_tracker +from paddlenlp.utils.log import logger +import paddlenlp.ops as ops +from visualdl import LogWriter + +from args import parse_args +import sys +sys.path.insert(0, "../") +from data_tools.dataset_utils import build_train_valid_test_datasets + +MODEL_CLASSES = { + "ernie": (ErnieModel, ErnieForPretraining, ErniePretrainingCriterion, + ErnieTokenizer), +} + + +def create_pretrained_dataset( + args, + data_file, + tokenizer, + data_world_size, + data_world_rank, + max_seq_len, + places, + data_holders, + current_step=0, ): + + train_valid_test_num_samples = [ + args.global_batch_size * args.max_steps, args.micro_batch_size * + (args.max_steps // args.eval_freq + 1) * args.eval_iters * + data_world_size, args.micro_batch_size * args.test_iters + ] + train_ds, valid_ds, test_ds = build_train_valid_test_datasets( + data_prefix=data_file, + args=args, + tokenizer=tokenizer, + splits_string=args.split, + train_valid_test_num_samples=train_valid_test_num_samples, + max_seq_length=args.max_seq_len, + masked_lm_prob=args.masked_lm_prob, + short_seq_prob=args.short_seq_prob, + seed=args.seed, + skip_warmup=True, + binary_head=True, + max_seq_length_dec=None, + dataset_type='ernie') + # 测试集没有 + # test_ds=valid_ds + def _collate_data(data, stack_fn=Stack()): + num_fields = len(data[0]) + out = [None] * num_fields + # 0. input_ids, + # 1. segment_ids, + # 2. input_mask, + # 3. masked_lm_positions, + # 4. masked_lm_labels, + # 5. next_sentence_labels + for i in (0, 1, 2, 5): + out[i] = stack_fn([x[i] for x in data]) + out[5] = out[5].reshape([-1, 1]) + batch_size, seq_length = out[0].shape + size = num_mask = sum(len(x[3]) for x in data) + # masked_lm_positions + # Organize as a 1D tensor for gather or use gather_nd + if size % 8 != 0: + size += 8 - (size % 8) + out[3] = np.full(size, 0, dtype=np.int32) + # masked_lm_labels + out[4] = np.full([size, 1], -1, dtype=np.int64) + mask_token_num = 0 + for i, x in enumerate(data): + for j, pos in enumerate(x[3]): + out[3][mask_token_num] = i * seq_length + pos + out[4][mask_token_num] = x[4][j] + mask_token_num += 1 + + return out + + def loader(dataset, consumed_samples=0): + # print(dataset[0]) + batch_sampler = DistributedBatchSampler( + dataset, + batch_size=args.micro_batch_size, + num_replicas=data_world_size, + rank=data_world_rank, + shuffle=False, + drop_last=True, + consumed_samples=consumed_samples) + data_loader = paddle.io.DataLoader( + dataset=dataset, + places=places, + feed_list=data_holders, + batch_sampler=batch_sampler, + num_workers=args.num_workers, + worker_init_fn=None, + collate_fn=_collate_data, + return_list=False) + return data_loader + + train_dl = loader(train_ds, args.global_batch_size * current_step) + valid_dl = loader(valid_ds, args.micro_batch_size * ( + (current_step + 1) // args.eval_freq) * args.eval_iters * + data_world_size) + test_dl = loader(test_ds, 0) + + return train_dl, valid_dl, test_dl + + +def create_data_holder(args=None): + input_ids = paddle.static.data( + name="input_ids", shape=[-1, -1], dtype="int64") + segment_ids = paddle.static.data( + name="segment_ids", shape=[-1, -1], dtype="int64") + input_mask = paddle.static.data( + name="input_mask", shape=[-1, 1, 1, -1], dtype="float32") + masked_lm_positions = paddle.static.data( + name="masked_lm_positions", shape=[-1], dtype="int32") + masked_lm_labels = paddle.static.data( + name="masked_lm_labels", shape=[-1, 1], dtype="int64") + next_sentence_labels = paddle.static.data( + name="next_sentence_labels", shape=[-1, 1], dtype="int64") + + return [ + input_ids, segment_ids, input_mask, masked_lm_positions, + masked_lm_labels, next_sentence_labels + ] + + +def dist_optimizer(args, topo): + default_global_batch_size = topo.data_info.size * args.micro_batch_size + if args.global_batch_size is None: + args.global_batch_size = default_global_batch_size + + bsz_per_dp = args.global_batch_size // topo.data_info.size + micro_batch_size = args.micro_batch_size + assert args.global_batch_size % micro_batch_size == 0, \ + "cannot do gradient accumulate, global_batch_size: {} micro_batch_size: {}".format( + args.global_batch_size, micro_batch_size) + accumulate_steps = bsz_per_dp // micro_batch_size + + exec_strategy = paddle.fluid.ExecutionStrategy() + exec_strategy.num_threads = 1 + exec_strategy.num_iteration_per_drop_scope = 10000 + + build_strategy = paddle.static.BuildStrategy() + #build_strategy.enable_sequential_execution = True # for profile + build_strategy.fuse_broadcast_ops = True + build_strategy.enable_inplace = True + build_strategy.enable_addto = args.enable_addto + + dist_strategy = fleet.DistributedStrategy() + dist_strategy.execution_strategy = exec_strategy + dist_strategy.build_strategy = build_strategy + dist_strategy.nccl_comm_num = 3 + dist_strategy.fuse_grad_size_in_MB = 16 + + dist_strategy.recompute = args.use_recompute + dist_strategy.pipeline = args.pp_degree > 1 + + if args.pp_degree <= 1 and args.sharding_degree <= 1 and accumulate_steps > 1: + dist_strategy.gradient_merge = True + dist_strategy.gradient_merge_configs = {'k_steps': accumulate_steps} + args.eval_iters *= accumulate_steps + args.test_iters *= accumulate_steps + + if args.use_amp: + dist_strategy.amp = True + dist_strategy.amp_configs = { + "custom_white_list": [ + 'softmax', + 'layer_norm', + 'gelu', + ], + "custom_black_list": ['c_softmax_with_cross_entropy'], + "init_loss_scaling": 32768, + "use_dynamic_loss_scaling": True, + } + if args.use_sharding: + dist_strategy.sharding = True + dist_strategy.sharding_configs = { + "segment_broadcast_MB": 32, + "sharding_degree": args.sharding_degree, + "mp_degree": args.mp_degree, + "pp_degree": args.pp_degree, + "dp_degree": args.dp_degree, + "gradient_merge_acc_step": accumulate_steps + if args.sharding_degree > 1 else 1, + "optimize_offload": False, + } + if args.pp_degree > 1: + dist_strategy.pipeline_configs = { + "schedule_mode": "1F1B", + "micro_micro_batch_size": micro_batch_size, + "accumulate_steps": accumulate_steps, + } + + args.accumulate_steps = accumulate_steps + return dist_strategy + + +def get_train_data_file(args): + files = [ + os.path.join(args.input_dir, f) for f in os.listdir(args.input_dir) + if (os.path.isfile(os.path.join(args.input_dir, f)) and "_idx.npz" in + str(f)) + ] + files = [x.replace("_idx.npz", "") for x in files] + return files + + +def run_evaluate(data_loader, + exe, + program, + iter_steps, + log_writer, + global_step, + args, + is_last, + eval_fetch, + task_name="valid"): + all_loss, all_lm_loss, all_sop_loss = [], [], [] + local_time = time.time() + + for eval_step, batch in enumerate(data_loader): + ret = exe.run(program, feed=batch, fetch_list=eval_fetch) + loss_return, lm_loss_return, sop_loss_return = ret + if is_last: + all_loss.append(float(loss_return[0])) + all_lm_loss.append(float(lm_loss_return[0])) + all_sop_loss.append(float(sop_loss_return[0])) + + if eval_step >= iter_steps - 1: + if not is_last: + break + average_loss = sum(all_loss) / len(all_loss) + average_lm_loss = sum(all_lm_loss) / len(all_lm_loss) + average_sop_loss = sum(all_sop_loss) / len(all_sop_loss) + logger.info( + "%s step %d, batch: %d, loss: %f, lm_loss: %.6f, sop_loss: %.6f, speed: %.0f tokens/s" + % (task_name, global_step, eval_step, average_loss, + average_lm_loss, average_sop_loss, + iter_steps * args.micro_batch_size * args.max_seq_len / + (time.time() - local_time))) + + log_writer.add_scalar(task_name + "_loss", average_loss, + global_step) + log_writer.add_scalar(task_name + "_lm_loss", average_lm_loss, + global_step) + log_writer.add_scalar(task_name + "_sop_loss", average_sop_loss, + global_step) + + break + + +def do_train(args): + # Initialize the paddle and paddle fleet execute environment + paddle.enable_static() + fleet.init(is_collective=True) + + # Create the random seed for the worker + random.seed(args.seed) + np.random.seed(args.seed) + paddle.seed(args.seed) + get_rng_state_tracker().add('global_seed', args.seed) + get_rng_state_tracker().add('local_seed', + args.seed + fleet.worker_index() + 2021) + + assert args.device in [ + "cpu", "gpu", "xpu" + ], "Invalid device! Available device should be cpu, gpu, or xpu." + place = paddle.set_device(args.device) + + worker_num = fleet.worker_num() + worker_index = fleet.worker_index() + assert args.dp_degree * args.sharding_degree * args.mp_degree * args.pp_degree == worker_num, \ + "The product of degree num should be equal to worker_num." + + topo = Topology( + device_rank=worker_index, + world_size=worker_num, + dp_degree=args.dp_degree, + pp_degree=args.pp_degree, + sharding_degree=args.sharding_degree, + mp_degree=args.mp_degree) + + logger.info("The topo of hybrid parallelism:\n{}".format(topo)) + + dist_strategy = dist_optimizer(args, topo) + + # Create log write, train results show on last card of pipeline. + if topo.is_last: + log_writer_path = os.path.join( + args.output_dir, "train_log", + "{}_globalbsz_{}_amp_{}_recompute_{}_card_{}".format( + args.model_name_or_path, args.global_batch_size, args.use_amp, + args.use_recompute, worker_index).lower()) + # if os.path.exists(log_writer_path): + # shutil.rmtree(log_writer_path) + log_writer = LogWriter(log_writer_path) + + # Define the input data in the static mode + base_class, model_class, criterion_class, tokenizer_class = MODEL_CLASSES[ + args.model_type] + pretrained_models_list = list( + model_class.pretrained_init_configuration.keys()) + + # load config in checkpoint + global_step = 0 + consumed_samples = 0 + checkpoint_dir = os.path.join(args.output_dir, "model_last") + if os.path.exists(checkpoint_dir): + if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")): + with open(os.path.join(checkpoint_dir, "./config.yml"), "r") as f: + step_config = yaml.load(f, Loader=yaml.FullLoader) + assert step_config[ + "global_batch_size"] == args.global_batch_size, "Please ensure checkpoint global batch size is the same. Folder: {}".format( + checkpoint_dir) + consumed_samples = step_config["consumed_samples"] + global_step = step_config["global_step"] + + data_file = get_train_data_file(args) + main_program = paddle.static.default_main_program() + startup_program = paddle.static.default_startup_program() + with paddle.static.program_guard(main_program, startup_program): + data_holders = create_data_holder(args) + # 0. input_ids, + # 1. segment_ids, + # 2. input_mask, + # 3. masked_lm_positions, + # 4. masked_lm_labels, + # 5. next_sentence_labels + + [ + input_ids, segment_ids, input_mask, masked_lm_positions, + masked_lm_labels, next_sentence_labels + ] = data_holders + + tokenizer = tokenizer_class.from_pretrained(args.model_name_or_path) + + train_data_loader, valid_data_loader, test_data_loader = create_pretrained_dataset( + args, + data_file, + tokenizer, + data_world_size=topo.data_info.size, + data_world_rank=topo.data_info.rank, + max_seq_len=args.max_seq_len, + places=paddle.static.cuda_places(), + data_holders=data_holders, + current_step=global_step) + fleet.init(is_collective=True) + + if args.model_name_or_path in pretrained_models_list: + model_config = model_class.pretrained_init_configuration[ + args.model_name_or_path] + if model_config["vocab_size"] % 8 != 0: + model_config["vocab_size"] += 8 - (model_config["vocab_size"] % + 8) + model_config["hidden_dropout_prob"] = args.hidden_dropout_prob + model_config[ + "attention_probs_dropout_prob"] = args.attention_probs_dropout_prob + model = model_class(base_class(**model_config)) + else: + model, _ = model_class.from_pretrained( + args.model_name_or_path, + hidden_dropout_prob=args.hidden_dropout_prob, + attention_probs_dropout_prob=args.attention_probs_dropout_prob, + ) + + # Create the model for the gpt pretrain + prediction_scores, seq_relationship_score = model( + input_ids=input_ids, + token_type_ids=segment_ids, + position_ids=None, + attention_mask=input_mask, + masked_positions=masked_lm_positions) + + criterion = criterion_class() + lm_loss, sop_loss = criterion(prediction_scores, seq_relationship_score, + masked_lm_labels, next_sentence_labels) + loss = lm_loss + sop_loss + + # Create the learning_rate sheduler and optimizer + if args.decay_steps is None: + args.decay_steps = args.max_steps + + # lr_scheduler = CosineAnnealingWithWarmupDecay( + # max_lr=args.max_lr, + # min_lr=args.min_lr, + # warmup_step=args.warmup_rate * args.max_steps, + # decay_step=args.decay_steps, last_epoch=global_step) + + lr_scheduler = LinearDecayWithWarmup( + args.max_lr, + args.max_steps, + args.warmup_rate, + last_epoch=global_step) + + clip = None + if args.grad_clip > 0: + clip = paddle.fluid.clip.GradientClipByGlobalNorm( + clip_norm=args.grad_clip) + + decay_param = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + logger.info("Using paddle.optimizer.AdamW.") + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + beta1=args.adam_beta1, + beta2=args.adam_beta2, + epsilon=args.adam_epsilon, + grad_clip=clip, + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_param) + # alias + optimizer.apply_optimize = optimizer._apply_optimize + + # if args.use_recompute: + # dist_strategy.recompute = True + # dist_strategy.recompute_configs = { + # "checkpoints": model.bert.checkpoints + # } + + # Use the fleet api to compile the distributed optimizer + optimizer = fleet.distributed_optimizer( + optimizer, strategy=dist_strategy) + + optimizer.minimize(loss) + logger.info(f'final strategy: {fleet._final_strategy()}') + logger.info("The training meta optimizer is/are %s" % + fleet._get_applied_meta_list()) + + program_desc_dir = os.path.join(args.output_dir, "program_desc") + if not os.path.isdir(program_desc_dir): + os.mkdir(program_desc_dir) + + with open(program_desc_dir + "/main_program.txt.%d" % worker_index, + 'w') as f: + f.write(str(main_program)) + + with open(program_desc_dir + "/startup_program.txt.%d" % worker_index, + 'w') as f: + f.write(str(startup_program)) + + # Define the Executor for running the static model + exe = paddle.static.Executor(place) + exe.run(startup_program) + + test_program = main_program.clone(for_test=True) + + if args.model_name_or_path not in pretrained_models_list: + logger.info("Try to load checkpoint from %s " % args.model_name_or_path) + dygrah_path = os.path.join(args.model_name_or_path, + "model_state.pdparams") + static_path = os.path.join(args.model_name_or_path, "static_vars") + + flag_loaded = False + if os.path.exists(static_path): + if args.mp_degree > 1: + logger.warning("MP should init with dygraph params") + else: + logger.info("Loading parameters from %s" % static_path) + paddle.static.load(main_program, static_path, exe) + flag_loaded = True + + if not flag_loaded and os.path.exists(dygrah_path): + if args.sharding_degree > 1: + logger.warning("Sharding should init with static vars") + else: + logger.info("Loading parameters from %s" % dygrah_path) + init_static_with_params( + model, + paddle.load( + dygrah_path, return_numpy=True), + topo, + main_program) + flag_loaded = True + + if not flag_loaded: + logger.error("No checkpoint load.") + + # load checkpoint vars + if os.path.exists(checkpoint_dir): + if os.path.isfile(os.path.join(checkpoint_dir, "./config.yml")): + paddle.static.load(main_program, + os.path.join(checkpoint_dir, "static_vars"), exe) + + fetch_vars = collections.OrderedDict() + fetch_vars["loss"] = loss + fetch_vars["lm_loss"] = lm_loss + fetch_vars["sop_loss"] = sop_loss + fetch_vars["learning_rate"] = main_program.global_block().vars[ + "learning_rate_0"] + + additional_vars = collections.OrderedDict() + if args.use_amp: + for key in ["loss_scaling", "num_good_steps", "num_bad_steps"]: + additional_vars[key] = main_program.global_block().vars[key + "_0"] + + tic_train = time.time() + while True: + fetchs = [] + fetchs_keys = [] + if topo.is_last: + fetchs = list(fetch_vars.values()) + list(additional_vars.values()) + fetchs_keys = list(fetch_vars.keys()) + list(additional_vars.keys()) + + # Bug fix, if not call valid_data_loader, the enumerate will call valid_data_loader + # many times. and start a new random dataloader. + valid_data_loader = valid_data_loader() + test_data_loader = test_data_loader() + + + for step, batch in enumerate(train_data_loader()): + ret = exe.run(main_program, + feed=batch, + fetch_list=fetchs, + use_program_cache=True) + # Skip for accumulate_steps in global step + if (step + 1) % args.accumulate_steps != 0: + continue + global_step += 1 + # In the new 2.0 api, must call this function to change the learning_rate + lr_scheduler.step() + + if global_step % args.logging_freq == 0: + if topo.is_last: + res = {} + for k, v in zip(fetchs_keys, ret): + res[k] = v[0] + + speed = args.logging_freq / (time.time() - tic_train) + common_loginfo = "global step %d, loss: %.9f, lm_loss: %.6f, sop_loss: %.6f, speed: %.2f steps/s, ips: %.2f seqs/s, learning rate: %.5e" % ( + global_step, res["loss"], res["lm_loss"], + res["sop_loss"], speed, speed * args.global_batch_size, + res["learning_rate"]) + additional_loginfo = ", ".join([ + "{}: {}".format(k, res[k]) + for k in additional_vars.keys() + ]) + if additional_loginfo: + common_loginfo += ", " + additional_loginfo + logger.info(common_loginfo) + for k, v in res.items(): + log_writer.add_scalar(k, v, global_step) + + tic_train = time.time() + + #if args.check_accuracy: + # if global_step >= args.max_steps: + # return + # else: + # continue + + if global_step % args.eval_freq == 0: + # TODO, check the input data of validation + eval_fetch = [] + if topo.is_last: + eval_fetch = [loss, lm_loss, sop_loss] + + run_evaluate(valid_data_loader, exe, test_program, + args.eval_iters, log_writer, global_step, args, + topo.is_last, eval_fetch, "valid") + tic_train = time.time() + + if global_step % args.save_steps == 0 or global_step >= args.max_steps: + output_dir = os.path.join(args.output_dir, + "model_%d" % global_step) + logger.debug("saving models to {}".format(output_dir)) + save_persistables(exe, + os.path.join(output_dir, "static_vars"), + main_program) + if global_step == args.save_steps: + model.init_config["init_args"][0].init_config.pop("topo", + None) + model.save_pretrained(output_dir) + tokenizer.save_pretrained(output_dir) + tic_train = time.time() + + if global_step % args.checkpoint_steps == 0: + output_dir = os.path.join(args.output_dir, "model_last") + if worker_index == 0: + if not os.path.exists(output_dir): + os.mkdir(output_dir) + output_dir_bak = os.path.join(args.output_dir, + "model_last_bak") + if os.path.exists(output_dir): + if os.path.exists(output_dir_bak): + shutil.rmtree(output_dir_bak) + shutil.move(output_dir, output_dir_bak) + os.mkdir(output_dir) + + step_config = { + "model_name": args.model_name_or_path, + "global_step": global_step, + "global_batch_size": args.global_batch_size, + "consumed_samples": + global_step * args.global_batch_size, + } + + with open(os.path.join(output_dir, "config.yml"), "w") as f: + yaml.dump( + step_config, + f, + encoding='utf-8', + allow_unicode=True) + + fleet.barrier_worker() + + logger.debug("saving models to {}".format(output_dir)) + if args.sharding_degree <= 1: + # Save on the first worker by default. + if worker_index == 0: + paddle.static.save(main_program, + os.path.join(output_dir, + "static_vars")) + else: + # Use save_persistables in sharding, but more slower + save_persistables(exe, + os.path.join(output_dir, "static_vars"), + main_program) + + if global_step >= args.max_steps: + eval_fetch = [] + if topo.is_last: + eval_fetch = [loss, lm_loss, sop_loss] + + run_evaluate(test_data_loader, exe, test_program, + args.test_iters, log_writer, global_step, args, + topo.is_last, eval_fetch, "test") + del train_data_loader + return + + +if __name__ == "__main__": + config = parse_args(MODEL_CLASSES) + do_train(config) diff --git a/application/neural_search/recall/domain_adaptive_pretraining/scripts/run_pretrain_static.sh b/application/neural_search/recall/domain_adaptive_pretraining/scripts/run_pretrain_static.sh new file mode 100644 index 000000000000..eed21f18443b --- /dev/null +++ b/application/neural_search/recall/domain_adaptive_pretraining/scripts/run_pretrain_static.sh @@ -0,0 +1,36 @@ +unset CUDA_VISIBLE_DEVICES + +task_name="ernie-1.0-dp8-gb1024" +rm -rf output/$task_name/log + +PYTHONPATH=../../../ python -u -m paddle.distributed.launch \ + --gpus "0,1,2,3" \ + --log_dir "output/$task_name/log" \ + run_pretrain_static.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-1.0" \ + --input_dir "./data" \ + --output_dir "output/$task_name" \ + --max_seq_len 512 \ + --micro_batch_size 32 \ + --global_batch_size 128 \ + --sharding_degree 1\ + --dp_degree 4 \ + --use_sharding false \ + --use_amp true \ + --use_recompute false \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 200000 \ + --save_steps 100000 \ + --checkpoint_steps 5000 \ + --decay_steps 1980000 \ + --weight_decay 0.01\ + --warmup_rate 0.01 \ + --grad_clip 1.0 \ + --num_workers 2 \ + --logging_freq 20\ + --eval_freq 1000 \ + --device "gpu" + +# NOTE: please set use_sharding=True for sharding_degree > 1 diff --git a/application/neural_search/recall/in_batch_negative/README.md b/application/neural_search/recall/in_batch_negative/README.md new file mode 100644 index 000000000000..0579059a8bc1 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/README.md @@ -0,0 +1,444 @@ +# In-batch Negatives + + **目录** + +* [背景介绍](#背景介绍) +* [In-batch Negatives](#In-batchNegatives) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 模型训练](#模型训练) + * [6. 评估](#开始评估) + * [7. 预测](#预测) + * [8. 部署](#部署) + + + +# 背景介绍 + +语义索引(可通俗理解为向量索引)技术是搜索引擎、推荐系统、广告系统在召回阶段的核心技术之一。语义索引模型的目标是:给定输入文本,模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。语义索引模型的效果直接决定了语义相关的物料能否被成功召回进入系统参与上层排序,从基础层面影响整个系统的效果。 + +在召回阶段,最常见的方式是通过双塔模型,学习Document(简写为Doc)的向量表示,对Doc端建立索引,用ANN召回。我们在这种方式的基础上,引入语义索引策略 [In-batch Negatives](https://arxiv.org/abs/2004.04906),以如下Batch size=4的训练数据为例: + + +``` +我手机丢了,我想换个手机 我想买个新手机,求推荐 +求秋色之空漫画全集 求秋色之空全集漫画 +学日语软件手机上的 手机学日语的软件 +侠盗飞车罪恶都市怎样改车 侠盗飞车罪恶都市怎么改车 +``` + +In-batch Negatives 策略的训练数据为语义相似的 Pair 对,策略核心是在 1 个 Batch 内同时基于 N 个负例进行梯度更新,将Batch 内除自身之外其它所有 Source Text 的相似文本 Target Text 作为负例,例如: 上例中“我手机丢了,我想换个手机” 有 1 个正例(”我想买个新手机,求推荐“),3 个负例(1.求秋色之空全集漫画,2.手机学日语的软件,3.侠盗飞车罪恶都市怎么改车)。 + + + + +# In-batch Negatives + + + +## 1. 技术方案和评估指标 + +### 技术方案 + +双塔模型,采用ERNIE1.0热启,在召回训练阶段引入In-batch Negatives 策略,使用hnswlib建立索引库,进行召回测试。 + + +### 评估指标 + +采用 Recall@1,Recall@5 ,Recall@10 ,Recall@20 和 Recall@50 指标来评估语义索引模型的召回效果。 + +Recall@K召回率是指预测的前topK(top-k是指从最后的按得分排序的召回列表中返回前k个结果)结果中检索出的相关结果数和库中所有的相关结果数的比率,衡量的是检索系统的查全率。 + +**效果评估** + +| 模型 | Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |策略简要说明| +| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- | +| In-batch Negatives | 51.301 | 65.309| 69.878| 73.996|78.881| Inbatch-negative有监督训练| + + + + + +## 2. 环境依赖 + +推荐使用GPU进行训练,在预测阶段使用CPU或者GPU均可。 + +**环境依赖** +* python >= 3.6 +* paddlepaddle >= 2.1.3 +* paddlenlp >= 2.2 +* [hnswlib](https://github.com/nmslib/hnswlib) >= 0.5.2 +* visualdl >= 2.2.2 + + + +## 3. 代码结构 + +``` +|—— data.py # 数据读取、数据转换等预处理逻辑 +|—— base_model.py # 语义索引模型基类 +|—— train_batch_neg.py # In-batch Negatives 策略的训练主脚本 +|—— batch_negative + |—— model.py # In-batch Negatives 策略核心网络结构 +|—— ann_util.py # Ann 建索引库相关函数 + + +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +|—— evaluate.py # 根据召回结果和评估集计算评估指标 +|—— predict.py # 给定输入文件,计算文本 pair 的相似度 +|—— export_model.py # 动态图转换成静态图 +|—— scripts + |—— export_model.sh # 动态图转换成静态图脚本 + |—— predict.sh # 预测bash版本 + |—— evaluate.sh # 评估bash版本 + |—— run_build_index.sh # 构建索引bash版本 + |—— train_batch_neg.sh # 训练bash版本 +|—— deploy + |—— python + |—— predict.py # PaddleInference + |—— deploy.sh # Paddle Inference部署脚本 +|—— inference.py # 动态图抽取向量 + +``` + + + +## 4. 数据准备 + +### 数据集说明 + +我们基于某文献检索平台数据,构造面向语义索引的训练集、测试集、召回库。 + +**训练集** 和 **验证集** 格式一致,训练集4k条,测试集2w条,每行由一对语义相似的文本Pair构成,以tab符分割,第一列是检索query,第二列由相关文献标题(+关键词)构成。样例数据如下: + +``` +宁夏社区图书馆服务体系布局现状分析 宁夏社区图书馆服务体系布局现状分析社区图书馆,社区图书馆服务,社区图书馆服务体系 +人口老龄化对京津冀经济 京津冀人口老龄化对区域经济增长的影响京津冀,人口老龄化,区域经济增长,固定效应模型 +英语广告中的模糊语 模糊语在英语广告中的应用及其功能模糊语,英语广告,表现形式,语用功能 +甘氨酸二肽的合成 甘氨酸二肽合成中缩合剂的选择甘氨酸,缩合剂,二肽 +``` + +**召回库** 用于模拟业务线上的全量语料库,评估模型的召回效果,计算相应的Recall指标。召回库总共30万条样本,每行由一列构成,文献标题(+关键词),样例数据如下: +``` +陕西省贫困地区城乡青春期少女生长发育调查青春期,生长发育,贫困地区 +五丈岩水库溢洪道加固工程中的新材料应用碳纤维布,粘钢加固技术,超细水泥,灌浆技术 +木塑复合材料在儿童卫浴家具中的应用探索木塑复合材料,儿童,卫浴家具 +泡沫铝准静态轴向压缩有限元仿真泡沫铝,准静态,轴向压缩,力学特性 +``` + + +### 数据集下载 + + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据 +├── recall # 召回(语义索引)数据集 + ├── corpus.csv # 用于测试的召回库 + ├── dev.csv # 召回验证集 + ├── test.csv # 召回测试集 + ├── train.csv # 召回训练集 + ├── train_unsupervised.csv # 无监督训练集 +├── sort # 排序数据集 + ├── test_pairwise.csv # 排序测试集 + ├── dev_pairwise.csv # 排序验证集 + └── train_pairwise.csv # 排序训练集 + +``` + + + +## 5. 模型训练 + +**语义索引训练模型下载链接:** + +以下模型结构参数为: `TrasformerLayer:12, Hidden:768, Heads:12, OutputEmbSize: 256` + +|Model|训练参数配置|硬件|MD5| +| ------------ | ------------ | ------------ |-----------| +|[batch_neg](https://bj.bcebos.com/v1/paddlenlp/models/inbatch_model.zip)|
margin:0.2 scale:30 epoch:3 lr:5E-5 bs:64 max_len:64
|
4卡 v100-16g
|f3e5c7d7b0b718c2530c5e1b136b2d74| + +### 训练环境说明 + + +- NVIDIA Driver Version: 440.64.00 +- Ubuntu 16.04.6 LTS (Docker) +- Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1,2,3 卡, 基于 In-batch Negatives 策略训练模型,数据量比较小,几分钟就可以完成。如果采用单机单卡训练,只需要把`--gpus`参数设置成单卡的卡号即可。 + +如果使用CPU进行训练,则需要吧`--gpus`参数去除,然后吧`device`设置成cpu即可,详细请参考train_batch_neg.sh文件的训练设置 + +然后运行下面的命令使用GPU训练,得到语义索引模型: + +``` +root_path=recall +python -u -m paddle.distributed.launch --gpus "0,1,2,3" \ + train_batch_neg.py \ + --device gpu \ + --save_dir ./checkpoints/${root_path} \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --output_emb_size 256 \ + --save_steps 10 \ + --max_seq_length 64 \ + --margin 0.2 \ + --train_set_file recall/train.csv + +``` + +参数含义说明 + +* `device`: 使用 cpu/gpu 进行训练 +* `batch_size`: 训练的batch size的大小 +* `learning_rate`: 训练的学习率的大小 +* `epochs`: 训练的epoch数 +* `save_dir`: 模型存储路径 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `save_steps`: 模型存储 checkpoint 的间隔 steps 个数 +* `max_seq_length`: 输入序列的最大长度 +* `margin`: 正样本相似度与负样本之间的目标 Gap +* `train_set_file`: 训练集文件 + + +也可以使用bash脚本: + +``` +sh scripts/train_batch_neg.sh +``` + + + + + +## 6. 评估 + +效果评估分为 4 个步骤: + +a. 获取Doc端Embedding + +基于语义索引模型抽取出Doc样本库的文本向量。 + +b. 采用hnswlib对Doc端Embedding建库 + +使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引) + +c. 获取Query的Embedding并查询相似结果 + +基于语义索引模型抽取出评估集 *Source Text* 的文本向量,在第 2 步中建立的索引库中进行 ANN 查询,召回 Top50 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件。 + +d. 评估 + +基于评估集 `same_semantic.tsv` 和召回结果 `recall_result` 计算评估指标 Recall@k,其中k取值1,5,10,20,50。 + +运行如下命令进行 ANN 建库、召回,产出召回结果数据 `recall_result` + +``` +root_dir="checkpoints/inbatch" +python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "recall/dev.csv" \ + --corpus_file "recall/corpus.csv" +``` +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `recall_result_dir`: 召回结果存储目录 +* `recall_result_file`: 召回结果的文件名 +* `params_path`: 待评估模型的参数文件名 +* `hnsw_m`: hnsw 算法相关参数,保持默认即可 +* `hnsw_ef`: hnsw 算法相关参数,保持默认即可 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `recall_num`: 对 1 个文本召回的相似文本数量 +* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv +* `corpus_file`: 召回库数据 corpus_file + +也可以使用下面的bash脚本: + +``` +sh scripts/run_build_index.sh +``` + +run_build_index.sh还包含cpu和gpu运行的脚本,默认是gpu的脚本 + +成功运行结束后,会在 `./recall_result_dir/` 目录下产出 `recall_result.txt` 文件 + +``` +热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响 热处理对尼龙6及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响尼龙6,聚酰胺嵌段共聚物,芳香聚酰胺,热处理 0.9831992387771606 +热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响 热处理方法对高强高模聚乙烯醇纤维性能的影响聚乙烯醇纤维,热处理,性能,热拉伸,热定型 0.8438636660575867 +热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响 制备工艺对PVC/ABS合金力学性能和维卡软化温度的影响PVC,ABS,正交试验,力学性能,维卡软化温度 0.8130228519439697 +..... +``` + + +接下来,运行如下命令进行效果评估,产出Recall@1, Recall@5, Recall@10, Recall@20 和 Recall@50 指标: +``` +python -u evaluate.py \ + --similar_text_pair "recall/test.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 +``` +也可以使用下面的bash脚本: + +``` +sh scripts/evaluate.sh +``` + +参数含义说明 +* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv +* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果 +* `recall_num`: 对 1 个文本召回的相似文本数量 + +成功运行结束后,会输出如下评估指标: + +``` +recall@1=51.261 +recall@5=65.279 +recall@10=69.848 +recall@20=73.971 +recall@50=78.84 +``` + + + +## 7. 预测 + +我们可以基于语义索引模型预测文本的语义向量或者计算文本 Pair 的语义相似度。 + +### 7.1 功能一:抽取文本的语义向量 + +修改 inference.py 文件里面输入文本 id2corpus 和模型路径 params_path : + +``` +params_path='checkpoints/inbatch/model_40/model_state.pdparams' +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} +``` +然后运行: +``` +python inference.py +``` +预测结果为256维的向量: + +``` +[1, 256] +[[ 0.07766181 -0.13780491 0.03388524 -0.14910668 -0.0334941 0.06780092 + 0.0104043 0.03168401 0.02605671 0.02088691 0.05520441 -0.0852212 + ..... +``` + +### 7.2 功能二:计算文本 Pair 的语义相似度 + + +### 准备预测数据 + +待预测数据为 tab 分隔的 csv 文件,每一行为 1 个文本 Pair,部分示例如下: +``` +试论我国海岸带经济开发的问题与前景 试论我国海岸带经济开发的问题与前景海岸带,经济开发,问题,前景 +外语阅读焦虑与英语成绩及性别的关系 外语阅读焦虑与英语成绩及性别的关系外语阅读焦虑,外语课堂焦虑,英语成绩,性别 +数字图书馆 智能化图书馆 +网络健康可信性研究 网络成瘾少年 +``` + +### 开始预测 + +以上述 demo 数据为例,运行如下命令基于我们开源的 [In-batch Negatives](https://arxiv.org/abs/2004.04906) 策略语义索引模型开始计算文本 Pair 的语义相似度: +``` +root_dir="checkpoints/inbatch" + +python -u -m paddle.distributed.launch --gpus "3" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --output_emb_size 256 \ + --batch_size 128 \ + --max_seq_length 64 \ + --text_pair_file "recall/test.csv" +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `params_path`: 预训练模型的参数文件名 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `text_pair_file`: 由文本 Pair 构成的待预测数据集 + +也可以运行下面的bash脚本: + +``` +sh scripts/predict.sh +``` +predict.sh文件包含了cpu和gpu运行的脚本,默认是gpu运行的脚本 + +产出如下结果 +``` +0.9717282652854919 +0.9371012449264526 +0.7968897223472595 +0.30377304553985596 +``` + + + +## 8. 部署 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### Paddle Inference预测 + +修改id2corpus的样本: + +``` +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} + +``` + +然后使用PaddleInference + +``` +python deploy/python/predict.py --model_dir=./output +``` +也可以运行下面的bash脚本: + +``` +sh deploy.sh +``` +输出结果如下: + +``` +(1, 256) +[[-0.0394925 -0.04474756 -0.065534 0.00939134 0.04359895 0.14659195 + -0.0091779 -0.07303623 0.09413272 -0.01255222 -0.08685658 0.02762237 + 0.10138468 0.00962821 0.10888419 0.04553023 0.05898942 0.00694253 + .... +``` + +## Reference + +[1] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih, Dense Passage Retrieval for Open-Domain Question Answering, Preprint 2020. diff --git a/application/neural_search/recall/in_batch_negative/ann_util.py b/application/neural_search/recall/in_batch_negative/ann_util.py new file mode 100644 index 000000000000..707e58e752d7 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/ann_util.py @@ -0,0 +1,60 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import numpy as np +import hnswlib +from paddlenlp.utils.log import logger + + +def build_index(args, data_loader, model): + + index = hnswlib.Index(space='ip', dim=args.output_emb_size) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index( + max_elements=args.hnsw_max_elements, + ef_construction=args.hnsw_ef, + M=args.hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(args.hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + + logger.info("start build index..........") + + all_embeddings = [] + + for text_embeddings in model.get_semantic_embedding(data_loader): + all_embeddings.append(text_embeddings.numpy()) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + + logger.info("Total index number:{}".format(index.get_current_count())) + + return index diff --git a/application/neural_search/recall/in_batch_negative/base_model.py b/application/neural_search/recall/in_batch_negative/base_model.py new file mode 100644 index 000000000000..3b4c08641f56 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/base_model.py @@ -0,0 +1,181 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import abc +import sys + +import numpy as np + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SemanticIndexBase(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off beteween + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr( + initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + 768, output_emb_size, weight_attr=weight_attr) + + @paddle.jit.to_static(input_spec=[paddle.static.InputSpec(shape=[None, None], dtype='int64'),paddle.static.InputSpec(shape=[None, None], dtype='int64')]) + def get_pooled_embedding(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, + attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = self.get_pooled_embedding( + input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim(self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, + query_attention_mask) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, + title_attention_mask) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, + axis=-1) + return cosine_sim + + @abc.abstractmethod + def forward(self): + pass + + + +class SemanticIndexBaseStatic(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off beteween + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr( + initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + 768, output_emb_size, weight_attr=weight_attr) + + @paddle.jit.to_static(input_spec=[paddle.static.InputSpec(shape=[None, None], dtype='int64'),paddle.static.InputSpec(shape=[None, None], dtype='int64')]) + def get_pooled_embedding(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, + attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = self.get_pooled_embedding( + input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim(self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, + query_attention_mask) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, + title_attention_mask) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, + axis=-1) + return cosine_sim + + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, + attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding \ No newline at end of file diff --git a/application/neural_search/recall/in_batch_negative/batch_negative/model.py b/application/neural_search/recall/in_batch_negative/batch_negative/model.py new file mode 100644 index 000000000000..9d883e222561 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/batch_negative/model.py @@ -0,0 +1,74 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import sys + +import numpy as np +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + +from base_model import SemanticIndexBase + +class SemanticIndexBatchNeg(SemanticIndexBase): + def __init__(self, + pretrained_model, + dropout=None, + margin=0.3, + scale=30, + output_emb_size=None): + super().__init__(pretrained_model, dropout, output_emb_size) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + def forward(self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, + query_attention_mask) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, + title_attention_mask) + + cosine_sim = paddle.matmul( + query_cls_embedding, title_cls_embedding, transpose_y=True) + + # substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], + fill_value=self.margin, + dtype=paddle.get_default_dtype()) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64') + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss diff --git a/application/neural_search/recall/in_batch_negative/data.py b/application/neural_search/recall/in_batch_negative/data.py new file mode 100644 index 000000000000..9eb9a1d0d499 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/data.py @@ -0,0 +1,184 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import paddle + +from paddlenlp.utils.log import logger + + +def create_dataloader(dataset, + mode='train', + batch_size=1, + batchify_fn=None, + trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == 'train' else False + if mode == 'train': + batch_sampler = paddle.io.DistributedBatchSampler( + dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler( + dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader( + dataset=dataset, + batch_sampler=batch_sampler, + collate_fn=batchify_fn, + return_list=True) + + +def convert_example(example, + tokenizer, + max_seq_length=512, + pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer( + text=text, + max_seq_len=max_seq_length, + pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, 'r', encoding='utf-8') as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 2: + continue + yield {'text_a': data[0], 'text_b': data[1]} + + +def read_text_triplet(data_path): + """Reads data.""" + with open(data_path, 'r', encoding='utf-8') as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield { + 'text': data[0], + 'pos_sample': data[1], + 'neg_sample': data[2] + } + + +# ANN - active learning ------------------------------------------------------ +def get_latest_checkpoint(args): + """ + Return: (latest_checkpint_path, global_step) + """ + if not os.path.exists(args.save_dir): + return args.init_from_ckpt, 0 + + subdirectories = list(next(os.walk(args.save_dir))[1]) + + def valid_checkpoint(checkpoint): + chk_path = os.path.join(args.save_dir, checkpoint) + scheduler_path = os.path.join(chk_path, "model_state.pdparams") + succeed_flag_file = os.path.join(chk_path, "succeed_flag_file") + return os.path.exists(scheduler_path) and os.path.exists( + succeed_flag_file) + + trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(trained_steps) > 0: + return os.path.join(args.save_dir, + str(max(trained_steps)), + "model_state.pdparams"), max(trained_steps) + + return args.init_from_ckpt, 0 + + +# ANN - active learning ------------------------------------------------------ +def get_latest_ann_data(ann_data_dir): + if not os.path.exists(ann_data_dir): + return None, -1 + + subdirectories = list(next(os.walk(ann_data_dir))[1]) + + def valid_checkpoint(step): + ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data") + # succed_flag_file is an empty file that indicates ann data has been generated + succeed_flag_file = os.path.join(ann_data_dir, step, + "succeed_flag_file") + return os.path.exists(succeed_flag_file) and os.path.exists( + ann_data_file) + + ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(ann_data_steps) > 0: + latest_ann_data_file = os.path.join(ann_data_dir, + str(max(ann_data_steps)), + "new_ann_data") + logger.info("Using lateset ann_data_file:{}".format( + latest_ann_data_file)) + return latest_ann_data_file, max(ann_data_steps) + + logger.info("no new ann_data, return (None, -1)") + return None, -1 + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, 'r', encoding='utf-8') as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, 'r', encoding='utf-8') as f: + for line in f: + splited_line = line.rstrip().split("\t") + if len(splited_line) != 2: + continue + + text, similar_text = line.rstrip().split("\t") + + if not text or not similar_text: + continue + + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text diff --git a/application/neural_search/recall/in_batch_negative/deploy/python/deploy.sh b/application/neural_search/recall/in_batch_negative/deploy/python/deploy.sh new file mode 100644 index 000000000000..fe8f071e0a47 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/deploy/python/deploy.sh @@ -0,0 +1 @@ +python predict.py --model_dir=../../output \ No newline at end of file diff --git a/application/neural_search/recall/in_batch_negative/deploy/python/predict.py b/application/neural_search/recall/in_batch_negative/deploy/python/predict.py new file mode 100644 index 000000000000..c5c139b15dd2 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/deploy/python/predict.py @@ -0,0 +1,230 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import numpy as np +import paddle +import paddlenlp as ppnlp +from scipy.special import softmax +from paddle import inference +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.utils.log import logger + +sys.path.append('.') + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, + help="The directory to static model.") + +parser.add_argument("--max_seq_length", default=128, type=int, + help="The maximum total input sequence length after tokenization. Sequences " + "longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=15, type=int, + help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", + help="Select which device to train model, defaults to gpu.") + +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], + help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], + help='The tensorrt precision.') + +parser.add_argument('--cpu_threads', default=10, type=int, + help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], + help='Enable to use mkldnn to speed up when using cpu.') + +parser.add_argument("--benchmark", type=eval, default=False, + help="To log some information about environment and running.") +parser.add_argument("--save_log_path", type=str, default="./log_output/", + help="The file path to save log.") +args = parser.parse_args() +# yapf: enable + + +def convert_example(example, + tokenizer, + max_seq_length=512, + pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer( + text=text, + max_seq_len=max_seq_length, + pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + +class Predictor(object): + def __init__(self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.pdmodel" + params_file = model_dir + "/inference.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as intialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8 + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, + min_subgraph_size=30, + precision_mode=precision_mode) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [ + self.predictor.get_input_handle(name) + for name in self.predictor.get_input_names() + ] + self.output_handle = self.predictor.get_output_handle( + self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name="ernie-1.0", + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=args.save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=[ + 'preprocess_time', 'inference_time', 'postprocess_time' + ], + warmup=0, + logger=logger) + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for text in data: + input_ids, segment_ids = convert_example( + text, tokenizer) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + if args.benchmark: + self.autolog.times.stamp() + + if args.benchmark: + self.autolog.times.end(stamp=True) + + return logits + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor(args.model_dir, args.device, args.max_seq_length, + args.batch_size, args.use_tensorrt, args.precision, + args.cpu_threads, args.enable_mkldnn) + + # ErnieTinyTokenizer is special for ernie-tiny pretained model. + output_emb_size=256 + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + res=predictor.predict(corpus_list, tokenizer) + print(res.shape) + print(res) \ No newline at end of file diff --git a/application/neural_search/recall/in_batch_negative/evaluate.py b/application/neural_search/recall/in_batch_negative/evaluate.py new file mode 100644 index 000000000000..f59aae3897d7 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/evaluate.py @@ -0,0 +1,94 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse + +import numpy as np + +from paddlenlp.utils.log import logger +import time + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--similar_text_pair", type=str, default='', help="The full path of similat pair file") +parser.add_argument("--recall_result_file", type=str, default='', help="The full path of recall result file") +parser.add_argument("--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query") + + +args = parser.parse_args() +# yapf: enable + + +def recall(rs, N=10): + """ + Ratio of recalled Ground Truth at topN Recalled Docs + >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] + >>> recall(rs, N=1) + 0.333333 + >>> recall(rs, N=2) + >>> 0.6666667 + >>> recall(rs, N=3) + >>> 1.0 + Args: + rs: Iterator of recalled flag() + Returns: + Recall@N + """ + + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +if __name__ == "__main__": + text2similar = {} + with open(args.similar_text_pair, 'r', encoding='utf-8') as f: + for line in f: + text, similar_text = line.rstrip().split("\t") + text2similar[text] = similar_text + + rs = [] + + with open(args.recall_result_file, 'r', encoding='utf-8') as f: + relevance_labels = [] + for index, line in enumerate(f): + + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + + text, recalled_text, cosine_sim = line.rstrip().split("\t") + if text == recalled_text: + continue + if text2similar[text] == recalled_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + # print(len(rs)) + # print(rs[:50]) + + recall_N = [] + recall_num=[1,5,10,20,50] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + result=open('result.tsv','a') + res=[] + timestamp=time.strftime('%Y%m%d-%H%M%S',time.localtime()) + res.append(timestamp) + for key,val in zip(recall_num,recall_N): + print('recall@{}={}'.format(key,val)) + res.append(str(val)) + result.write('\t'.join(res)+'\n') + # print("\t".join(recall_N)) diff --git a/application/neural_search/recall/in_batch_negative/export_model.py b/application/neural_search/recall/in_batch_negative/export_model.py new file mode 100644 index 000000000000..3da4205fc003 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/export_model.py @@ -0,0 +1,62 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad + +from base_model import SemanticIndexBase,SemanticIndexBaseStatic + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + # If you want to use ernie1.0 model, plesace uncomment the following code + output_emb_size=256 + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0") + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + model = SemanticIndexBaseStatic( + pretrained_model, output_emb_size=output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec( + shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec( + shape=[None, None], dtype="int64") # segment_ids + ]) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) \ No newline at end of file diff --git a/application/neural_search/recall/in_batch_negative/inference.py b/application/neural_search/recall/in_batch_negative/inference.py new file mode 100644 index 000000000000..41fea8b3f163 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/inference.py @@ -0,0 +1,82 @@ +from functools import partial +import argparse +import os +import sys +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset, MapDataset, load_dataset +from paddlenlp.utils.log import logger + +from base_model import SemanticIndexBase,SemanticIndexBaseStatic +from data import convert_example, create_dataloader +from data import gen_id2corpus, gen_text_file +from ann_util import build_index +from tqdm import tqdm + + +if __name__ == "__main__": + device= 'gpu' + max_seq_length=64 + output_emb_size=256 + batch_size=1 + params_path='checkpoints/inbatch/model_40/model_state.pdparams' + id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} + paddle.set_device(device) + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0") + + model = SemanticIndexBaseStatic( + pretrained_model, output_emb_size=output_emb_size) + + # Load pretrained semantic model + if params_path and os.path.isfile(params_path): + state_dict = paddle.load(params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % params_path) + else: + raise ValueError( + "Please set --params_path with correct pretrained model file") + + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, + mode='predict', + batch_size=batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + all_embeddings = [] + model.eval() + with paddle.no_grad(): + for batch_data in corpus_data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = model(input_ids, token_type_ids) + all_embeddings.append(text_embeddings) + + text_embedding=all_embeddings[0] + print(text_embedding.shape) + print(text_embedding.numpy()) diff --git a/application/neural_search/recall/in_batch_negative/predict.py b/application/neural_search/recall/in_batch_negative/predict.py new file mode 100644 index 000000000000..2a6f289fb15d --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/predict.py @@ -0,0 +1,127 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import argparse +import sys +import os +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.datasets import load_dataset +from paddlenlp.data import Stack, Tuple, Pad + +from data import read_text_pair, convert_example, create_dataloader +from base_model import SemanticIndexBase + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--pad_to_max_seq_len", action="store_true", help="Whether to pad to max seq length.") +args = parser.parse_args() +# yapf: enable + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. + data_loaer (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + cosine_sims = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + + query_input_ids = paddle.to_tensor(query_input_ids) + query_token_type_ids = paddle.to_tensor(query_token_type_ids) + title_input_ids = paddle.to_tensor(title_input_ids) + title_token_type_ids = paddle.to_tensor(title_token_type_ids) + + batch_cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids).numpy() + + cosine_sims.append(batch_cosine_sim) + + cosine_sims = np.concatenate(cosine_sims, axis=0) + + return cosine_sims + + +if __name__ == "__main__": + paddle.set_device(args.device) + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length, + pad_to_max_seq_len=args.pad_to_max_seq_len) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment + ): [data for data in fn(samples)] + + valid_ds = load_dataset( + read_text_pair, data_path=args.text_pair_file, lazy=False) + + valid_data_loader = create_dataloader( + valid_ds, + mode='predict', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained( + "ernie-1.0") + + model = SemanticIndexBase( + pretrained_model, output_emb_size=args.output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError( + "Please set --params_path with correct pretrained model file") + + cosin_sim = predict(model, valid_data_loader) + + for idx, cosine in enumerate(cosin_sim): + print('{}'.format(cosine)) diff --git a/application/neural_search/recall/in_batch_negative/recall.py b/application/neural_search/recall/in_batch_negative/recall.py new file mode 100644 index 000000000000..351316b67c4a --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/recall.py @@ -0,0 +1,141 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +from functools import partial +import argparse +import os +import sys +import random +import time + +import numpy as np +import hnswlib +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset, MapDataset, load_dataset +from paddlenlp.utils.log import logger + +from base_model import SemanticIndexBase +from data import convert_example, create_dataloader +from data import gen_id2corpus, gen_text_file +from ann_util import build_index + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained( + "ernie-1.0") + + model = SemanticIndexBase( + pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError( + "Please set --params_path with correct pretrained model file") + + id2corpus = gen_id2corpus(args.corpus_file) + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, + mode='predict', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + # Need better way to get inner model of DataParallel + inner_model = model._layers + + final_index = build_index(args, corpus_data_loader, inner_model) + + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + + query_ds = MapDataset(text_list) + + query_data_loader = create_dataloader( + query_ds, + mode='predict', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + + recall_result_file = os.path.join(args.recall_result_dir, + args.recall_result_file) + with open(recall_result_file, 'w', encoding='utf-8') as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query( + batch_query_embedding.numpy(), args.recall_num) + + batch_size = len(cosine_sims) + + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write("{}\t{}\t{}\n".format(text_list[text_index][ + "text"], id2corpus[doc_idx], 1.0 - cosine_sims[ + row_index][idx])) diff --git a/application/neural_search/recall/in_batch_negative/scripts/evaluate.sh b/application/neural_search/recall/in_batch_negative/scripts/evaluate.sh new file mode 100644 index 000000000000..84d6f162b80e --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/scripts/evaluate.sh @@ -0,0 +1,4 @@ +python -u evaluate.py \ + --similar_text_pair "recall/dev.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 \ No newline at end of file diff --git a/application/neural_search/recall/in_batch_negative/scripts/export_model.sh b/application/neural_search/recall/in_batch_negative/scripts/export_model.sh new file mode 100644 index 000000000000..f59ecefbfbab --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/scripts/export_model.sh @@ -0,0 +1 @@ +python export_model.py --params_path checkpoints/inbatch/model_40/model_state.pdparams --output_path=./output \ No newline at end of file diff --git a/application/neural_search/recall/in_batch_negative/scripts/predict.sh b/application/neural_search/recall/in_batch_negative/scripts/predict.sh new file mode 100644 index 000000000000..5a253520ded0 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/scripts/predict.sh @@ -0,0 +1,22 @@ +# gpu version + +root_dir="checkpoints/inbatch" +python -u -m paddle.distributed.launch --gpus "3" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --output_emb_size 256 \ + --batch_size 128 \ + --max_seq_length 64 \ + --text_pair_file "recall/test.csv" + + +# cpu +# root_dir="checkpoints/inbatch" +# python predict.py \ +# --device cpu \ +# --params_path "${root_dir}/model_40/model_state.pdparams" \ +# --output_emb_size 256 \ +# --batch_size 128 \ +# --max_seq_length 64 \ +# --text_pair_file "recall/test.csv" diff --git a/application/neural_search/recall/in_batch_negative/scripts/run_build_index.sh b/application/neural_search/recall/in_batch_negative/scripts/run_build_index.sh new file mode 100755 index 000000000000..a9f400dfb401 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/scripts/run_build_index.sh @@ -0,0 +1,31 @@ +# GPU version +root_dir="checkpoints/inbatch" +python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "recall/dev.csv" \ + --corpus_file "recall/corpus.csv" + +# CPU version +# python recall.py \ +# --device cpu \ +# --recall_result_dir "recall_result_dir" \ +# --recall_result_file "recall_result.txt" \ +# --params_path "${root_dir}/model_40/model_state.pdparams" \ +# --hnsw_m 100 \ +# --hnsw_ef 100 \ +# --batch_size 64 \ +# --output_emb_size 256\ +# --max_seq_length 60 \ +# --recall_num 50 \ +# --similar_text_pair "recall/dev.csv" \ +# --corpus_file "recall/corpus.csv" \ No newline at end of file diff --git a/application/neural_search/recall/in_batch_negative/scripts/train_batch_neg.sh b/application/neural_search/recall/in_batch_negative/scripts/train_batch_neg.sh new file mode 100644 index 000000000000..fc40e1f1872b --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/scripts/train_batch_neg.sh @@ -0,0 +1,61 @@ +# GPU training +root_path=inbatch +python -u -m paddle.distributed.launch --gpus "0,1,2,3" \ + train_batch_neg.py \ + --device gpu \ + --save_dir ./checkpoints/${root_path} \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --output_emb_size 256 \ + --save_steps 10 \ + --max_seq_length 64 \ + --margin 0.2 \ + --train_set_file recall/train.csv + + +# cpu training +# root_path=inbatch +# python train_batch_neg.py \ +# --device cpu \ +# --save_dir ./checkpoints/${root_path} \ +# --batch_size 64 \ +# --learning_rate 5E-5 \ +# --epochs 3 \ +# --output_emb_size 256 \ +# --save_steps 10 \ +# --max_seq_length 64 \ +# --margin 0.2 \ +# --train_set_file recall/train.csv + + + +# 加载simcse训练的模型,模型放在simcse/model_20000 +# python -u -m paddle.distributed.launch --gpus "0,1,2,3" \ +# train_batch_neg.py \ +# --device gpu \ +# --save_dir ./checkpoints/simcse_inbatch_negative \ +# --batch_size 64 \ +# --learning_rate 5E-5 \ +# --epochs 3 \ +# --output_emb_size 256 \ +# --save_steps 10 \ +# --max_seq_length 64 \ +# --margin 0.2 \ +# --train_set_file data/${root_path}/train.csv \ +# --init_from_ckpt simcse/model_20000/model_state.pdparams + +# 加载post training的模型,模型放在simcse/post_model_10000 +# python -u -m paddle.distributed.launch --gpus "0,1,2,3" \ +# train_batch_neg.py \ +# --device gpu \ +# --save_dir ./checkpoints/post_simcse_inbatch_negative \ +# --batch_size 64 \ +# --learning_rate 5E-5 \ +# --epochs 3 \ +# --output_emb_size 256 \ +# --save_steps 10 \ +# --max_seq_length 64 \ +# --margin 0.2 \ +# --train_set_file data/${root_path}/train.csv \ +# --init_from_ckpt simcse/post_model_10000/model_state.pdparams diff --git a/application/neural_search/recall/in_batch_negative/train_batch_neg.py b/application/neural_search/recall/in_batch_negative/train_batch_neg.py new file mode 100644 index 000000000000..53435f7bab25 --- /dev/null +++ b/application/neural_search/recall/in_batch_negative/train_batch_neg.py @@ -0,0 +1,165 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import argparse +import os +import sys +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F + +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup + +from batch_negative.model import SemanticIndexBatchNeg +from data import read_text_pair, convert_example, create_dataloader + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proption over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Inteval steps to save checkpoint") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file") +parser.add_argument("--margin", default=0.3, type=float, help="Margin beteween pos_sample and neg_samples") +parser.add_argument("--scale", default=30, type=int, help="Scale for pair-wise margin_rank_loss") + + +args = parser.parse_args() +# yapf: enable + + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + + train_ds = load_dataset( + read_text_pair, data_path=args.train_set_file, lazy=False) + + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained( + 'ernie-1.0') + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment + ): [data for data in fn(samples)] + + train_data_loader = create_dataloader( + train_ds, + mode='train', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + model = SemanticIndexBatchNeg( + pretrained_model, + margin=args.margin, + scale=args.scale, + output_emb_size=args.output_emb_size) + + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, + args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids) + + global_step += 1 + if global_step % 50 == 0 and rank == 0: + print( + "global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, + 10 / (time.time() - tic_train))) + tic_train = time.time() + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % global_step) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, 'model_state.pdparams') + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + + +if __name__ == "__main__": + do_train() diff --git a/application/neural_search/recall/milvus/README.md b/application/neural_search/recall/milvus/README.md new file mode 100644 index 000000000000..c24b801c7cb0 --- /dev/null +++ b/application/neural_search/recall/milvus/README.md @@ -0,0 +1,214 @@ + **目录** + +* [背景介绍](#背景介绍) +* [Milvus召回](#Milvus召回) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 向量检索](#向量检索) + + + + +# 背景介绍 + +基于某检索平台开源的数据集构造生成了面向语义索引的召回库。 + + + +# Milvus召回 + + + +## 1. 技术方案和评估指标 + +### 技术方案 + +使用 Milvus 搭建召回系统,然后使用训练好的语义索引模型,抽取向量,插入到 Milvus 中,然后进行检索。 + + + +## 2. 环境依赖和安装说明 + +**环境依赖** +* python >= 3.6 +* paddlepaddle >= 2.2 +* paddlenlp >= 2.2 +* milvus >= 1.1.1 +* pymilvus >= 1.1.2 + + + +## 3. 代码结构 + +## 代码结构: + +``` +|—— scripts + |—— feature_extract.sh 提取特征向量的bash脚本 +├── base_model.py # 语义索引模型基类 +├── config.py # milvus配置文件 +├── data.py # 数据处理函数 +├── embedding_insert.py # 插入向量 +├── embedding_recall.py # 检索topK相似结果 / ANN +├── inference.py # 动态图模型向量抽取脚本 +├── feature_extract.py # 批量抽取向量脚本 +├── milvus_insert.py # 插入向量工具类 +├── milvus_recall.py # 向量召回工具类 +├── README.md +└── server_config.yml # milvus的config文件,本项目所用的配置 +``` + + +## 4. 数据准备 + +数据集的样例如下,有两种,第一种是 title+keywords 进行拼接;第二种是一句话。 + +``` +煤矸石-污泥基活性炭介导强化污水厌氧消化煤矸石,污泥,复合基活性炭,厌氧消化,直接种间电子传递 +睡眠障碍与常见神经系统疾病的关系睡眠觉醒障碍,神经系统疾病,睡眠,快速眼运动,细胞增殖,阿尔茨海默病 +城市道路交通流中观仿真研究智能运输系统;城市交通管理;计算机仿真;城市道路;交通流;路径选择 +.... +``` + +### 数据集下载 + + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据 +├── recall # 召回(语义索引)数据集 + ├── corpus.csv # 用于测试的召回库 + ├── dev.csv # 召回验证集 + ├── test.csv # 召回测试集 + ├── train.csv # 召回训练集 + ├── train_unsupervised.csv # 无监督训练集 +├── sort # 排序数据集 + ├── test_pairwise.csv # 排序测试集 + ├── dev_pairwise.csv # 排序验证集 + └── train_pairwise.csv # 排序训练集 + +``` + + + +## 5. 向量检索 + + +数据准备结束以后,我们开始搭建Milvus的语义检索引擎,用于语义向量的快速检索,我们使用[Milvus](https://milvus.io/)开源工具进行召回,milvus的搭建教程请参考官方教程 [milvus官方安装教程](https://milvus.io/cn/docs/v1.1.1/milvus_docker-cpu.md)本案例使用的是milvus的1.1.1版本,搭建完以后启动milvus + + +``` +cd [Milvus root path]/core/milvus +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:[Milvus root path]/core/milvus/lib +cd scripts +./start_server.sh + +``` + +搭建完系统以后就可以插入和检索向量了,首先生成embedding向量,每个样本生成256维度的向量,使用的是32GB的V100的卡进行的提取: + +``` +root_dir="checkpoints" +python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \ + feature_extract.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 4096 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "recall/test.csv" \ + --corpus_file "milvus/milvus_data.csv" +``` + +| 数据量 | 时间 | +| ------------ | ------------ | +|1000万条|5hour50min03s| + +运行结束后会生成 corpus_embedding.npy + +生成了向量后,需要把数据抽炒入到Milvus库中,首先修改配置: + +修改config.py的配置ip: + +``` +MILVUS_HOST='your milvus ip' +``` + +然后运行下面的命令把向量插入到Milvus库中: + +``` +python3 embedding_insert.py +``` + + +| 数据量 | 时间 | +| ------------ | ------------ | +|1000万条|12min24s| + +另外,milvus提供了可视化的管理界面,可以很方便的查看数据,安装地址为[Milvus Enterprise Manager](https://zilliz.com/products/em). + +![](../../img/mem.png) + + +运行召回脚本: + +``` +python3 embedding_recall.py + +``` +运行的结果为,表示的是召回的id和与当前的query计算的距离: + +``` +10000000 +time cost 0.5410025119781494 s +Status(code=0, message='Search vectors successfully!') +[ +[ +(id:1, distance:0.0), +(id:7109733, distance:0.832247257232666), +(id:6770053, distance:0.8488889932632446), +(id:2653227, distance:0.9032443761825562), +... +``` + +第一次检索的时间大概是18s左右,需要把数据从磁盘加载到内存,后面检索就很快,下面是测试的速度: + +| 数据量 | 时间 | +| ------------ | ------------ | +|100条|0.15351247787475586| + + +修改代码的模型路径和样本: + +``` +params_path='checkpoints/model_40/model_state.pdparams' +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} +``` + +运行命令 + +``` +python3 inference.py + +``` +运行的输出为,分别是抽取的向量和召回的结果: + +``` +[1, 256] +[[ 0.06374735 -0.08051944 0.05118101 -0.05855767 -0.06969483 0.05318566 + 0.079629 0.02667932 -0.04501902 -0.01187392 0.09590752 -0.05831281 + .... +5677638 国有股权参股对家族企业创新投入的影响混合所有制改革,国有股权,家族企业,创新投入 0.5417419672012329 +1321645 高管政治联系对民营企业创新绩效的影响——董事会治理行为的非线性中介效应高管政治联系,创新绩效,民营上市公司,董事会治理行为,中介效应 0.5445536375045776 +1340319 国有控股上市公司资产并购重组风险探讨国有控股上市公司,并购重组,防范对策 0.5515031218528748 +.... +``` \ No newline at end of file diff --git a/application/neural_search/recall/milvus/base_model.py b/application/neural_search/recall/milvus/base_model.py new file mode 100644 index 000000000000..1fe2d6c6334e --- /dev/null +++ b/application/neural_search/recall/milvus/base_model.py @@ -0,0 +1,180 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import abc +import sys + +import numpy as np + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SemanticIndexBase(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off beteween + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr( + initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + 768, output_emb_size, weight_attr=weight_attr) + + @paddle.jit.to_static(input_spec=[paddle.static.InputSpec(shape=[None, None], dtype='int64'),paddle.static.InputSpec(shape=[None, None], dtype='int64')]) + def get_pooled_embedding(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, + attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = self.get_pooled_embedding( + input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim(self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, + query_attention_mask) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, + title_attention_mask) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, + axis=-1) + return cosine_sim + + @abc.abstractmethod + def forward(self): + pass + + + +class SemanticIndexBaseStatic(nn.Layer): + def __init__(self, pretrained_model, dropout=None, output_emb_size=None): + super().__init__() + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is not None, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off beteween + # recall performance and efficiency + + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr( + initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + 768, output_emb_size, weight_attr=weight_attr) + + @paddle.jit.to_static(input_spec=[paddle.static.InputSpec(shape=[None, None], dtype='int64'),paddle.static.InputSpec(shape=[None, None], dtype='int64')]) + def get_pooled_embedding(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, + attention_mask) + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = self.get_pooled_embedding( + input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim(self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, + query_attention_mask) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, + title_attention_mask) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, + axis=-1) + return cosine_sim + + + def forward(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None): + _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, + attention_mask) + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding \ No newline at end of file diff --git a/application/neural_search/recall/milvus/config.py b/application/neural_search/recall/milvus/config.py new file mode 100644 index 000000000000..6529e5d4e669 --- /dev/null +++ b/application/neural_search/recall/milvus/config.py @@ -0,0 +1,32 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from milvus import MetricType, IndexType + +MILVUS_HOST='10.21.226.173' +MILVUS_PORT = 8530 + +collection_param = { + 'dimension': 256, + 'index_file_size': 256, + 'metric_type': MetricType.L2 +} + +index_type = IndexType.IVF_FLAT +index_param = {'nlist': 1000} + +top_k = 100 +search_param = {'nprobe': 20} + diff --git a/application/neural_search/recall/milvus/data.py b/application/neural_search/recall/milvus/data.py new file mode 100644 index 000000000000..9eb9a1d0d499 --- /dev/null +++ b/application/neural_search/recall/milvus/data.py @@ -0,0 +1,184 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import paddle + +from paddlenlp.utils.log import logger + + +def create_dataloader(dataset, + mode='train', + batch_size=1, + batchify_fn=None, + trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + shuffle = True if mode == 'train' else False + if mode == 'train': + batch_sampler = paddle.io.DistributedBatchSampler( + dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler( + dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader( + dataset=dataset, + batch_sampler=batch_sampler, + collate_fn=batchify_fn, + return_list=True) + + +def convert_example(example, + tokenizer, + max_seq_length=512, + pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer( + text=text, + max_seq_len=max_seq_length, + pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def read_text_pair(data_path): + """Reads data.""" + with open(data_path, 'r', encoding='utf-8') as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 2: + continue + yield {'text_a': data[0], 'text_b': data[1]} + + +def read_text_triplet(data_path): + """Reads data.""" + with open(data_path, 'r', encoding='utf-8') as f: + for line in f: + data = line.rstrip().split("\t") + if len(data) != 3: + continue + yield { + 'text': data[0], + 'pos_sample': data[1], + 'neg_sample': data[2] + } + + +# ANN - active learning ------------------------------------------------------ +def get_latest_checkpoint(args): + """ + Return: (latest_checkpint_path, global_step) + """ + if not os.path.exists(args.save_dir): + return args.init_from_ckpt, 0 + + subdirectories = list(next(os.walk(args.save_dir))[1]) + + def valid_checkpoint(checkpoint): + chk_path = os.path.join(args.save_dir, checkpoint) + scheduler_path = os.path.join(chk_path, "model_state.pdparams") + succeed_flag_file = os.path.join(chk_path, "succeed_flag_file") + return os.path.exists(scheduler_path) and os.path.exists( + succeed_flag_file) + + trained_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(trained_steps) > 0: + return os.path.join(args.save_dir, + str(max(trained_steps)), + "model_state.pdparams"), max(trained_steps) + + return args.init_from_ckpt, 0 + + +# ANN - active learning ------------------------------------------------------ +def get_latest_ann_data(ann_data_dir): + if not os.path.exists(ann_data_dir): + return None, -1 + + subdirectories = list(next(os.walk(ann_data_dir))[1]) + + def valid_checkpoint(step): + ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data") + # succed_flag_file is an empty file that indicates ann data has been generated + succeed_flag_file = os.path.join(ann_data_dir, step, + "succeed_flag_file") + return os.path.exists(succeed_flag_file) and os.path.exists( + ann_data_file) + + ann_data_steps = [int(s) for s in subdirectories if valid_checkpoint(s)] + + if len(ann_data_steps) > 0: + latest_ann_data_file = os.path.join(ann_data_dir, + str(max(ann_data_steps)), + "new_ann_data") + logger.info("Using lateset ann_data_file:{}".format( + latest_ann_data_file)) + return latest_ann_data_file, max(ann_data_steps) + + logger.info("no new ann_data, return (None, -1)") + return None, -1 + + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, 'r', encoding='utf-8') as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, 'r', encoding='utf-8') as f: + for line in f: + splited_line = line.rstrip().split("\t") + if len(splited_line) != 2: + continue + + text, similar_text = line.rstrip().split("\t") + + if not text or not similar_text: + continue + + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text diff --git a/application/neural_search/recall/milvus/embedding_insert.py b/application/neural_search/recall/milvus/embedding_insert.py new file mode 100644 index 000000000000..4795017521e2 --- /dev/null +++ b/application/neural_search/recall/milvus/embedding_insert.py @@ -0,0 +1,37 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from milvus_insert import VecToMilvus +import random +from tqdm import tqdm + +embeddings=np.load('corpus_embedding.npy') +print(embeddings.shape) + +embedding_ids = [i for i in range(embeddings.shape[0])] +print(len(embedding_ids)) +client = VecToMilvus() +collection_name = 'literature_search' +partition_tag = 'partition_2' +data_size=len(embedding_ids) +batch_size=100000 +for i in tqdm(range(0,data_size,batch_size)): + cur_end=i+batch_size + if(cur_end>data_size): + cur_end=data_size + batch_emb=embeddings[np.arange(i,cur_end)] + status, ids = client.insert(collection_name=collection_name, vectors=batch_emb.tolist(), ids=embedding_ids[i:i+batch_size],partition_tag=partition_tag) + # print(status) + # print(ids) diff --git a/application/neural_search/recall/milvus/embedding_recall.py b/application/neural_search/recall/milvus/embedding_recall.py new file mode 100644 index 000000000000..3cf1992a7b27 --- /dev/null +++ b/application/neural_search/recall/milvus/embedding_recall.py @@ -0,0 +1,40 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import numpy as np +from milvus_insert import VecToMilvus +import random +from tqdm import tqdm +from milvus_recall import RecallByMilvus +import time + +embeddings=np.load('corpus_embedding.npy') +print(embeddings.shape) + +embedding_ids = [i for i in range(embeddings.shape[0])] +print(len(embedding_ids)) +client = VecToMilvus() +collection_name = 'literature_search' +partition_tag = 'partition_2' +data_size=len(embedding_ids) +client = RecallByMilvus() +embeddings = embeddings[np.arange(1,2)] +time_start = time.time() #开始计时 +status, resultes = client.search(collection_name=collection_name, vectors=embeddings, partition_tag=partition_tag) +time_end = time.time() #结束计时 + +sum_t=time_end - time_start #运行所花时间 +print('time cost', sum_t, 's') +print(status) +print(resultes) \ No newline at end of file diff --git a/application/neural_search/recall/milvus/feature_extract.py b/application/neural_search/recall/milvus/feature_extract.py new file mode 100644 index 000000000000..28a75d298e85 --- /dev/null +++ b/application/neural_search/recall/milvus/feature_extract.py @@ -0,0 +1,113 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import argparse +import os +import sys +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset, MapDataset, load_dataset +from paddlenlp.utils.log import logger + +from base_model import SemanticIndexBase +from data import convert_example, create_dataloader +from data import gen_id2corpus, gen_text_file +from tqdm import tqdm + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() + + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained( + "ernie-1.0") + + model = SemanticIndexBase( + pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError( + "Please set --params_path with correct pretrained model file") + + id2corpus = gen_id2corpus(args.corpus_file) + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, + mode='predict', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + # Need better way to get inner model of DataParallel + inner_model = model._layers + + + all_embeddings = [] + + for text_embeddings in tqdm(inner_model.get_semantic_embedding(corpus_data_loader)): + all_embeddings.append(text_embeddings.numpy()) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + # print(all_embeddings.shape) + np.save('corpus_embedding',all_embeddings) \ No newline at end of file diff --git a/application/neural_search/recall/milvus/inference.py b/application/neural_search/recall/milvus/inference.py new file mode 100644 index 000000000000..77f97421ce1e --- /dev/null +++ b/application/neural_search/recall/milvus/inference.py @@ -0,0 +1,110 @@ +from functools import partial +import argparse +import os +import sys +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset, MapDataset, load_dataset +from paddlenlp.utils.log import logger + +from base_model import SemanticIndexBaseStatic +from data import convert_example, create_dataloader +from data import gen_id2corpus, gen_text_file +from tqdm import tqdm +from milvus_recall import RecallByMilvus + + +def search_in_milvus(text_embedding): + collection_name = 'literature_search' + partition_tag = 'partition_2' + client = RecallByMilvus() + status, results = client.search(collection_name=collection_name, vectors=text_embedding.tolist(), partition_tag=partition_tag) + # print(status) + # print(resultes) + corpus_file="milvus/milvus_data.csv" + id2corpus = gen_id2corpus(corpus_file) + # print(status) + # print(results) + for line in results: + for item in line: + idx=item.id + distance=item.distance + text=id2corpus[idx] + print(idx,text,distance) + + + +if __name__ == "__main__": + device= 'gpu' + max_seq_length=64 + output_emb_size=256 + batch_size=1 + params_path='checkpoints/model_40/model_state.pdparams' + id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} + paddle.set_device(device) + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained( + "ernie-1.0") + + model = SemanticIndexBaseStatic( + pretrained_model, output_emb_size=output_emb_size) + + # Load pretrained semantic model + if params_path and os.path.isfile(params_path): + state_dict = paddle.load(params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % params_path) + else: + raise ValueError( + "Please set --params_path with correct pretrained model file") + + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, + mode='predict', + batch_size=batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + # Need better way to get inner model of DataParallel + + all_embeddings = [] + + with paddle.no_grad(): + for batch_data in corpus_data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = model( + input_ids, token_type_ids) + all_embeddings.append(text_embeddings) + + + text_embedding=all_embeddings[0] + print(text_embedding.shape) + print(text_embedding) + search_in_milvus(text_embedding) + + diff --git a/application/neural_search/recall/milvus/milvus_insert.py b/application/neural_search/recall/milvus/milvus_insert.py new file mode 100644 index 000000000000..0f7b01019ea0 --- /dev/null +++ b/application/neural_search/recall/milvus/milvus_insert.py @@ -0,0 +1,90 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +from milvus import * +from config import MILVUS_HOST, MILVUS_PORT, collection_param, index_type, index_param + + +class VecToMilvus(): + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def has_collection(self, collection_name): + try: + status, ok = self.client.has_collection(collection_name) + return ok + except Exception as e: + print("Milvus has_table error:", e) + + def creat_collection(self, collection_name): + try: + collection_param['collection_name'] = collection_name + status = self.client.create_collection(collection_param) + print(status) + return status + except Exception as e: + print("Milvus create collection error:", e) + + def create_index(self, collection_name): + try: + status = self.client.create_index(collection_name, index_type, index_param) + print(status) + return status + except Exception as e: + print("Milvus create index error:", e) + + def has_partition(self, collection_name, partition_tag): + try: + status, ok = self.client.has_partition(collection_name, partition_tag) + return ok + except Exception as e: + print("Milvus has partition error: ", e) + + def create_partition(self, collection_name, partition_tag): + try: + status = self.client.create_partition(collection_name, partition_tag) + print('create partition {} successfully'.format(partition_tag)) + return status + except Exception as e: + print('Milvus create partition error: ', e) + + def insert(self, vectors, collection_name, ids=None, partition_tag=None): + try: + if not self.has_collection(collection_name): + self.creat_collection(collection_name) + self.create_index(collection_name) + print('collection info: {}'.format(self.client.get_collection_info(collection_name)[1])) + if (partition_tag is not None) and (not self.has_partition(collection_name, partition_tag)): + self.create_partition(collection_name, partition_tag) + status, ids = self.client.insert(collection_name=collection_name, records=vectors, ids=ids, + partition_tag=partition_tag) + self.client.flush([collection_name]) + print('Insert {} entities, there are {} entities after insert data.'.format(len(ids), self.client.count_entities(collection_name)[1])) + return status, ids + except Exception as e: + print("Milvus insert error:", e) + + +if __name__ == '__main__': + import random + + client = VecToMilvus() + collection_name = 'test1' + partition_tag = 'partition_1' + ids = [random.randint(0, 1000) for _ in range(100)] + embeddings = [[random.random() for _ in range(128)] for _ in range(100)] + status, ids = client.insert(collection_name=collection_name, vectors=embeddings, ids=ids,partition_tag=partition_tag) + print(status) + print(ids) diff --git a/application/neural_search/recall/milvus/milvus_recall.py b/application/neural_search/recall/milvus/milvus_recall.py new file mode 100644 index 000000000000..e28c7aee390e --- /dev/null +++ b/application/neural_search/recall/milvus/milvus_recall.py @@ -0,0 +1,41 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from milvus import * +from config import MILVUS_HOST, MILVUS_PORT, top_k, search_param + + +class RecallByMilvus(): + def __init__(self): + self.client = Milvus(host=MILVUS_HOST, port=MILVUS_PORT) + + def search(self, vectors, collection_name, partition_tag=None): + try: + status, results = self.client.search(collection_name=collection_name, query_records=vectors, top_k=top_k, + params=search_param, partition_tag=partition_tag) + # print(status) + return status, results + except Exception as e: + print('Milvus recall error: ', e) + + +if __name__ == '__main__': + import random + client = RecallByMilvus() + collection_name = 'test1' + partition_tag = 'partition_3' + embeddings = [[random.random() for _ in range(128)] for _ in range(2)] + status, resultes = client.search(collection_name=collection_name, vectors=embeddings, partition_tag=partition_tag) + print(status) + print(resultes) diff --git a/application/neural_search/recall/milvus/scripts/feature_extract.sh b/application/neural_search/recall/milvus/scripts/feature_extract.sh new file mode 100644 index 000000000000..4a296edf41c9 --- /dev/null +++ b/application/neural_search/recall/milvus/scripts/feature_extract.sh @@ -0,0 +1,15 @@ +root_dir="checkpoints" +python -u -m paddle.distributed.launch --gpus "3" --log_dir "recall_log/" \ + feature_extract.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "${root_dir}/model_40/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 4096 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "recall/test.csv" \ + --corpus_file "milvus/milvus_data.csv" \ No newline at end of file diff --git a/application/neural_search/recall/milvus/server_config.yml b/application/neural_search/recall/milvus/server_config.yml new file mode 100644 index 000000000000..462f54319f42 --- /dev/null +++ b/application/neural_search/recall/milvus/server_config.yml @@ -0,0 +1,236 @@ +# Copyright (C) 2019-2020 Zilliz. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software distributed under the License +# is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express +# or implied. See the License for the specific language governing permissions and limitations under the License. + +version: 0.5 + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# Cluster Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# enable | If running with Mishards, set true, otherwise false. | Boolean | false | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# role | Milvus deployment role: rw / ro | Role | rw | +#----------------------+------------------------------------------------------------+------------+-----------------+ +cluster: + enable: false + role: rw + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# General Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# timezone | Use UTC-x or UTC+x to specify a time zone. | Timezone | UTC+8 | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# meta_uri | URI for metadata storage, using SQLite (for single server | URI | sqlite://:@:/ | +# | Milvus) or MySQL (for distributed cluster Milvus). | | | +# | Format: dialect://username:password@host:port/database | | | +# | Keep 'dialect://:@:/', 'dialect' can be either 'sqlite' or | | | +# | 'mysql', replace other texts with real values. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# meta_ssl_ca | The path of the Certificate Authority (CA) certificate | String | | +# | file in PEM format. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# meta_ssl_key | The path of the client SSL private key file in PEM format. | String | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# meta_ssl_cert | The path of the client SSL public key certificate file in | String | | +# | PEM format. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +general: + timezone: UTC+8 + meta_uri: sqlite://:@:/ + meta_ssl_ca: + meta_ssl_key: + meta_ssl_cert: + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# Network Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# bind.address | IP address that Milvus server monitors. | IP | 0.0.0.0 | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# bind.port | Port that Milvus server monitors. Port range (1024, 65535) | Integer | 19530 | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# http.enable | Enable HTTP server or not. | Boolean | true | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# http.port | Port that Milvus HTTP server monitors. | Integer | 19121 | +# | Port range (1024, 65535) | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +network: + bind.address: 0.0.0.0 + bind.port: 8530 + http.enable: true + http.port: 8121 + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# Storage Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# path | Path used to save meta data, vector data and index data. | Path | /var/lib/milvus | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# auto_flush_interval | The interval, in seconds, at which Milvus automatically | Integer | 1 (s) | +# | flushes data to disk. | | | +# | 0 means disable the regular flush. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +storage: + path: /tmp/milvus + auto_flush_interval: 1 + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# WAL Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# enable | Whether to enable write-ahead logging (WAL) in Milvus. | Boolean | true | +# | If WAL is enabled, Milvus writes all data changes to log | | | +# | files in advance before implementing data changes. WAL | | | +# | ensures the atomicity and durability for Milvus operations.| | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# recovery_error_ignore| Whether to ignore logs with errors that happens during WAL | Boolean | false | +# | recovery. If true, when Milvus restarts for recovery and | | | +# | there are errors in WAL log files, log files with errors | | | +# | are ignored. If false, Milvus does not restart when there | | | +# | are errors in WAL log files. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# buffer_size | Sum total of the read buffer and the write buffer in Bytes.| String | 256MB | +# | buffer_size must be in range [64MB, 4096MB]. | | | +# | If the value you specified is out of range, Milvus | | | +# | automatically uses the boundary value closest to the | | | +# | specified value. It is recommended you set buffer_size to | | | +# | a value greater than the inserted data size of a single | | | +# | insert operation for better performance. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# path | Location of WAL log files. | String | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +wal: + enable: true + recovery_error_ignore: false + buffer_size: 256MB + path: /tmp/milvus/wal + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# Cache Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# cache_size | The size of CPU memory used for caching data for faster | String | 4GB | +# | query. The sum of 'cache_size' and 'insert_buffer_size' | | | +# | must be less than system memory size. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# insert_buffer_size | Buffer size used for data insertion. | String | 1GB | +# | The sum of 'insert_buffer_size' and 'cache_size' | | | +# | must be less than system memory size. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# preload_collection | A comma-separated list of collection names that need to | StringList | | +# | be pre-loaded when Milvus server starts up. | | | +# | '*' means preload all existing tables (single-quote or | | | +# | double-quote required). | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +cache: + cache_size: 32GB + insert_buffer_size: 8GB + preload_collection: + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# GPU Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# enable | Use GPU devices or not. | Boolean | false | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# cache.enable | Enable cache index on GPU devices or not. | Boolean | false | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# cache_size | The size of GPU memory per card used for cache. | String | 1GB | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# gpu_search_threshold | A Milvus performance tuning parameter. This value will be | Integer | 1000 | +# | compared with 'nq' to decide if the search computation will| | | +# | be executed on GPUs only. | | | +# | If nq >= gpu_search_threshold, the search computation will | | | +# | be executed on GPUs only; | | | +# | if nq < gpu_search_threshold, the search computation will | | | +# | be executed on CPUs only. | | | +# | The SQ8H index is special, if nq < gpu_search_threshold, | | | +# | the search will be executed on both CPUs and GPUs. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# search_devices | The list of GPU devices used for search computation. | DeviceList | gpu0 | +# | Must be in format gpux. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# build_index_devices | The list of GPU devices used for index building. | DeviceList | gpu0 | +# | Must be in format gpux. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +gpu: + enable: false + cache.enable: false + cache_size: 1GB + gpu_search_threshold: 1000 + search_devices: + - gpu0 + build_index_devices: + - gpu0 + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# FPGA Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# enable | Use FPGA devices or not. | Boolean | false | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# search_devices | The list of FPGA devices used for search computation. | DeviceList | fpga0 | +# | Must be in format fpgax. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +fpga: + enable: false + search_devices: + - fpga0 + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# APU Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# enable | Use APU devices or not. | Boolean | false | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# apu_devices | The number of APU devices exist for computation. | DeviceList | 1 | +#----------------------+------------------------------------------------------------+------------+-----------------+ +apu: + enable: false + search_devices: 1 + + + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# Logs Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# level | Log level in Milvus. Must be one of debug, info, warning, | String | debug | +# | error, fatal | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# trace.enable | Whether to enable trace level logging in Milvus. | Boolean | true | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# path | Absolute path to the folder holding the log files. | String | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# max_log_file_size | The maximum size of each log file, size range | String | 1024MB | +# | [512MB, 4096MB]. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# log_rotate_num | The maximum number of log files that Milvus keeps for each | Integer | 0 | +# | logging level, num range [0, 1024], 0 means unlimited. | | | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# log_to_stdout | Whether to write logs to standard output in Milvus. | Boolean | false | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# log_to_file | Whether to write logs to files in Milvus | Boolean | true | +#----------------------+------------------------------------------------------------+------------+-----------------+ +logs: + level: debug + trace.enable: true + path: /tmp/milvus/logs + max_log_file_size: 1024MB + log_rotate_num: 0 + log_to_stdout: false + log_to_file: true + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# Metric Config | Description | Type | Default | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# enable | Enable monitoring function or not. | Boolean | false | +#----------------------+------------------------------------------------------------+------------+-----------------+ +# address | Pushgateway address | IP | 127.0.0.1 + +#----------------------+------------------------------------------------------------+------------+-----------------+ +# port | Pushgateway port, port range (1024, 65535) | Integer | 9091 | +#----------------------+------------------------------------------------------------+------------+-----------------+ +metric: + enable: false + address: 127.0.0.1 + port: 9091 + diff --git a/application/neural_search/recall/simcse/README.md b/application/neural_search/recall/simcse/README.md new file mode 100644 index 000000000000..1fc8584b4c8e --- /dev/null +++ b/application/neural_search/recall/simcse/README.md @@ -0,0 +1,429 @@ + + **目录** + +* [背景介绍](#背景介绍) +* [SimCSE](#SimCSE) + * [1. 技术方案和评估指标](#技术方案) + * [2. 环境依赖](#环境依赖) + * [3. 代码结构](#代码结构) + * [4. 数据准备](#数据准备) + * [5. 模型训练](#模型训练) + * [6. 评估](#开始评估) + * [7. 预测](#预测) + * [8. 部署](#部署) + + + +# 背景介绍 + +语义索引(可通俗理解为向量索引)技术是搜索引擎、推荐系统、广告系统在召回阶段的核心技术之一。语义索引模型的目标是:给定输入文本,模型可以从海量候选召回库中**快速、准确**地召回一批语义相关文本。语义索引模型的效果直接决定了语义相关的物料能否被成功召回进入系统参与上层排序,从基础层面影响整个系统的效果。 + +在召回阶段,最常见的方式是通过双塔模型,学习Document(简写为Doc)的向量表示,对Doc端建立索引,用ANN召回。我们在这种方式的基础上,引入无监督预训练策略,以如下训练数据为例: + + +``` +我手机丢了,我想换个手机 我想买个新手机,求推荐 +求秋色之空漫画全集 求秋色之空全集漫画 +学日语软件手机上的 手机学日语的软件 +侠盗飞车罪恶都市怎样改车 侠盗飞车罪恶都市怎么改车 +``` + +SimCSE 模型适合缺乏监督数据,但是又有大量无监督数据的匹配和检索场景。 + + + + +# SimCSE + + + +## 1. 技术方案和评估指标 + +### 技术方案 + +双塔模型,采用ERNIE1.0热启,在召回阶段引入 SimCSE 策略。 + + +### 评估指标 + +(1)采用 Recall@1,Recall@5 ,Recall@10 ,Recall@20 和 Recall@50 指标来评估语义索引模型的召回效果。 + +**效果评估** + +| 模型 | Recall@1 | Recall@5 |Recall@10 |Recall@20 |Recall@50 |策略简要说明| +| ------------ | ------------ | ------------ |--------- |--------- |--------- |--------- | +| SimCSE | 42.374 | 57.505| 62.641| 67.09|72.331| SimCSE无监督训练| + + + + +## 2. 环境依赖和安装说明 + +**环境依赖** +* python >= 3.6 +* paddlepaddle >= 2.1.3 +* paddlenlp >= 2.2 +* [hnswlib](https://github.com/nmslib/hnswlib) >= 0.5.2 +* visualdl >= 2.2.2 + + + + + +## 3. 代码结构 + +以下是本项目主要代码结构及说明: + +``` +simcse/ +├── model.py # SimCSE 模型组网代码 +|—— deploy + |—— python + |—— predict.py # PaddleInference + ├── deploy.sh # Paddle Inference的bash脚本 +|—— scripts + ├── export_model.sh # 动态图转静态图bash脚本 + ├── predict.sh # 预测的bash脚本 + ├── evaluate.sh # 召回评估bash脚本 + ├── run_build_index.sh # 索引的构建脚本 + ├── train.sh # 训练的bash脚本 +|—— ann_util.py # Ann 建索引库相关函数 +├── data.py # 无监督语义匹配训练数据、测试数据的读取逻辑 +├── export_model.py # 动态图转静态图 +├── predict.py # 基于训练好的无监督语义匹配模型计算文本 Pair 相似度 +├── evaluate.py # 根据召回结果和评估集计算评估指标 +|—— inference.py # 动态图抽取向量 +|—— recall.py # 基于训练好的语义索引模型,从召回库中召回给定文本的相似文本 +└── train.py # SimCSE 模型训练、评估逻辑 + +``` + + + +## 4. 数据准备 + +### 数据集说明 + +我们基于开源的语义匹配数据集构造生成了面向语义索引的训练集、评估集、召回库。 + +样例数据如下: +``` +睡眠障碍与常见神经系统疾病的关系睡眠觉醒障碍,神经系统疾病,睡眠,快速眼运动,细胞增殖,阿尔茨海默病 +城市道路交通流中观仿真研究 +城市道路交通流中观仿真研究智能运输系统;城市交通管理;计算机仿真;城市道路;交通流;路径选择 +网络健康可信性研究 +网络健康可信性研究网络健康信息;可信性;评估模式 +脑瘫患儿家庭复原力的影响因素及干预模式雏形 研究 +脑瘫患儿家庭复原力的影响因素及干预模式雏形研究脑瘫患儿;家庭功能;干预模式 +地西他滨与HA方案治疗骨髓增生异常综合征转化的急性髓系白血病患者近期疗效比较 +地西他滨与HA方案治疗骨髓增生异常综合征转化的急性髓系白血病患者近期疗效比较 +个案工作 社会化 +个案社会工作介入社区矫正再社会化研究——以东莞市清溪镇为例社会工作者;社区矫正人员;再社会化;角色定位 +圆周运动加速度角速度 +圆周运动向心加速度物理意义的理论分析匀速圆周运动,向心加速度,物理意义,角速度,物理量,线速度,周期 +``` + +召回集,验证集,测试集与inbatch-negative实验的数据保持一致 + + +### 数据集下载 + + +- [literature_search_data](https://bj.bcebos.com/v1/paddlenlp/data/literature_search_data.zip) + +``` +├── milvus # milvus建库数据集 + ├── milvus_data.csv. # 构建召回库的数据 +├── recall # 召回(语义索引)数据集 + ├── corpus.csv # 用于测试的召回库 + ├── dev.csv # 召回验证集 + ├── test.csv # 召回测试集 + ├── train.csv # 召回训练集 + ├── train_unsupervised.csv # 无监督训练集 +├── sort # 排序数据集 + ├── test_pairwise.csv # 排序测试集 + ├── dev_pairwise.csv # 排序验证集 + └── train_pairwise.csv # 排序训练集 + +``` + + + +## 5. 模型训练 + +**语义索引预训练模型下载链接:** + +以下模型结构参数为: `TrasformerLayer:12, Hidden:768, Heads:12, OutputEmbSize: 256` + +|Model|训练参数配置|硬件|MD5| +| ------------ | ------------ | ------------ |-----------| +|[SimCSE](https://bj.bcebos.com/v1/paddlenlp/models/simcse_model.zip)|
epoch:3 lr:5E-5 bs:64 max_len:64
|
4卡 v100-16g
|7c46d9b15a214292e3897c0eb70d0c9f| + +### 训练环境说明 + ++ NVIDIA Driver Version: 440.64.00 ++ Ubuntu 16.04.6 LTS (Docker) ++ Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz + + +### 单机单卡训练/单机多卡训练 + +这里采用单机多卡方式进行训练,通过如下命令,指定 GPU 0,1,2,3 卡, 基于SimCSE训练模型,无监督的数据量比较大,4卡的训练的时长在16个小时左右。如果采用单机单卡训练,只需要把`--gpu`参数设置成单卡的卡号即可。 + +训练的命令如下: + +```shell +$ unset CUDA_VISIBLE_DEVICES +python -u -m paddle.distributed.launch --gpus '0,1,2,3' \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --save_steps 2000 \ + --eval_steps 100 \ + --max_seq_length 64 \ + --infer_with_fc_pooler \ + --dropout 0.2 \ + --output_emb_size 256 \ + --train_set_file "./recall/train_unsupervised.csv" \ + --test_set_file "./recall/dev.csv" +``` +也可以使用bash脚本: + +``` +sh scripts/train.sh +``` + + + +可支持配置的参数: + +* `infer_with_fc_pooler`:可选,在预测阶段计算文本 embedding 表示的时候网络前向是否会过训练阶段最后一层的 fc; 建议打开模型效果最好。 +* `scale`:可选,在计算 cross_entropy loss 之前对 cosine 相似度进行缩放的因子;默认为 20。 +* `dropout`:可选,SimCSE 网络前向使用的 dropout 取值;默认 0.1。 +* `save_dir`:可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下。 +* `max_seq_length`:可选,ERNIE-Gram 模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为128。 +* `batch_size`:可选,批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。 +* `learning_rate`:可选,Fine-tune的最大学习率;默认为5e-5。 +* `weight_decay`:可选,控制正则项力度的参数,用于防止过拟合,默认为0.0。 +* `epochs`: 训练轮次,默认为1。 +* `warmup_proption`:可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。 +* `init_from_ckpt`:可选,模型参数路径,热启动模型训练;默认为None。 +* `seed`:可选,随机种子,默认为1000. +* `device`: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。 + +程序运行时将会自动进行训练,评估。同时训练过程中会自动保存模型在指定的`save_dir`中。 +如: +```text +checkpoints/ +├── model_100 +│   ├── model_state.pdparams +│   ├── tokenizer_config.json +│   └── vocab.txt +└── ... +``` + + + +## 6. 评估 + +效果评估分为 4 个步骤: + +a. 获取Doc端Embedding + +基于语义索引模型抽取出Doc样本库的文本向量, + +b. 采用hnswlib对Doc端Embedding建库 + +使用 ANN 引擎构建索引库(这里基于 [hnswlib](https://github.com/nmslib/hnswlib) 进行 ANN 索引) + +c. 获取Query的Embedding并查询相似结果 + +基于语义索引模型抽取出评估集 *Source Text* 的文本向量,在第 2 步中建立的索引库中进行 ANN 查询,召回 Top50 最相似的 *Target Text*, 产出评估集中 *Source Text* 的召回结果 `recall_result` 文件 + +d. 评估 + +基于评估集 `same_semantic.tsv` 和召回结果 `recall_result` 计算评估指标 Recall@k,其中k取值1,5,10,20,50. + +运行如下命令进行 ANN 建库、召回,产出召回结果数据 `recall_result` + +``` +python -u -m paddle.distributed.launch --gpus "6" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "checkpoints/model_20000/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "recall/dev.csv" \ + --corpus_file "recall/corpus.csv" +``` +也可以使用下面的bash脚本: + +``` +sh scripts/run_build_index.sh +``` + +run_build_index.sh还包含cpu和gpu运行的脚本,默认是gpu的脚本 + + +接下来,运行如下命令进行效果评估,产出Recall@1, Recall@5, Recall@10, Recall@20 和 Recall@50 指标: +``` +python -u evaluate.py \ + --similar_text_pair "recall/dev.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 +``` +也可以使用下面的bash脚本: + +``` +bash scripts/evaluate.sh +``` + +参数含义说明 +* `similar_text_pair`: 由相似文本对构成的评估集 semantic_similar_pair.tsv +* `recall_result_file`: 针对评估集中第一列文本 *Source Text* 的召回结果 +* `recall_num`: 对 1 个文本召回的相似文本数量 + +成功运行结束后,会输出如下评估指标: + +``` +recall@1=45.183 +recall@5=60.444 +recall@10=65.224 +recall@20=69.562 +recall@50=74.848 +``` + + + + +## 7. 预测 + +我们可以基于语义索引模型预测文本的语义向量或者计算文本 Pair 的语义相似度。 + +### 7.1 功能一:抽取文本的语义向量 + +修改 inference.py 文件里面输入文本 id2corpus 和模型路径 params_path: + +``` +params_path='checkpoints/model_20000/model_state.pdparams' +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} +``` +然后运行 +``` +python inference.py +``` +预测结果位256维的向量: + +``` +[1, 256] +[[-6.70653954e-02 -6.46878220e-03 -6.78317016e-03 1.66617986e-02 + 7.20006675e-02 -9.79134627e-03 -1.38441555e-03 4.37440760e-02 + 4.78116237e-02 1.33881181e-01 1.82927232e-02 3.23656350e-02 + ... +``` + +### 7.2 功能二:计算文本 Pair 的语义相似度 + +### 准备预测数据 + +待预测数据为 tab 分隔的 tsv 文件,每一行为 1 个文本 Pair,部分示例如下: +``` +热处理对尼龙6 及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响 热处理对尼龙6及其与聚酰胺嵌段共聚物共混体系晶体熔融行为和结晶结构的影响尼龙6,聚酰胺嵌段共聚物,芳香聚酰胺,热处理 +面向生态系统服务的生态系统分类方案研发与应用. 面向生态系统服务的生态系统分类方案研发与应用 +huntington舞蹈病的动物模型 Huntington舞蹈病的动物模型 +试论我国海岸带经济开发的问题与前景 试论我国海岸带经济开发的问题与前景海岸带,经济开发,问题,前景 +``` + +### 开始预测 + +以上述 demo 数据为例,运行如下命令基于我们开源的 SimCSE无监督语义索引模型开始计算文本 Pair 的语义相似度: +``` +root_dir="checkpoints" + +python -u -m paddle.distributed.launch --gpus "3" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_20000/model_state.pdparams" \ + --output_emb_size 256 \ + --batch_size 128 \ + --max_seq_length 64 \ + --text_pair_file "recall/test.csv" +``` + +参数含义说明 +* `device`: 使用 cpu/gpu 进行训练 +* `params_path`: 预训练模型的参数文件名 +* `output_emb_size`: Transformer 顶层输出的文本向量维度 +* `text_pair_file`: 由文本 Pair 构成的待预测数据集 + +也可以运行下面的bash脚本: + +``` +sh scripts/predict.sh +``` + +产出如下结果 +``` +0.6477588415145874 +0.9698382019996643 +1.0 +0.1787596344947815 +``` + + + +## 8. 部署 + +### 动转静导出 + +首先把动态图模型转换为静态图: + +``` +python export_model.py --params_path checkpoints/model_20000/model_state.pdparams --output_path=./output +``` +也可以运行下面的bash脚本: + +``` +sh scripts/export_model.sh +``` + +### Paddle Inference预测 + +修改id2corpus的样本: + +``` +id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} + +``` +然后使用PaddleInference + +``` +python deploy/python/predict.py --model_dir=./output +``` +也可以运行下面的bash脚本: + +``` +sh deploy.sh +``` +最终输出的是256维度的特征向量 + +``` +(1, 256) +[[-6.70653731e-02 -6.46873191e-03 -6.78317575e-03 1.66618153e-02 + 7.20006898e-02 -9.79136024e-03 -1.38439541e-03 4.37440872e-02 + 4.78115827e-02 1.33881137e-01 1.82927139e-02 3.23656537e-02 + ....... +``` + + +## Reference +[1] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” ArXiv:2104.08821 [Cs], April 18, 2021. http://arxiv.org/abs/2104.08821. diff --git a/application/neural_search/recall/simcse/ann_util.py b/application/neural_search/recall/simcse/ann_util.py new file mode 100644 index 000000000000..707e58e752d7 --- /dev/null +++ b/application/neural_search/recall/simcse/ann_util.py @@ -0,0 +1,60 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +import numpy as np +import hnswlib +from paddlenlp.utils.log import logger + + +def build_index(args, data_loader, model): + + index = hnswlib.Index(space='ip', dim=args.output_emb_size) + + # Initializing index + # max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded + # during insertion of an element. + # The capacity can be increased by saving/loading the index, see below. + # + # ef_construction - controls index search speed/build speed tradeoff + # + # M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M) + # Higher M leads to higher accuracy/run_time at fixed ef/efConstruction + index.init_index( + max_elements=args.hnsw_max_elements, + ef_construction=args.hnsw_ef, + M=args.hnsw_m) + + # Controlling the recall by setting ef: + # higher ef leads to better accuracy, but slower search + index.set_ef(args.hnsw_ef) + + # Set number of threads used during batch search/construction + # By default using all available cores + index.set_num_threads(16) + + logger.info("start build index..........") + + all_embeddings = [] + + for text_embeddings in model.get_semantic_embedding(data_loader): + all_embeddings.append(text_embeddings.numpy()) + + all_embeddings = np.concatenate(all_embeddings, axis=0) + index.add_items(all_embeddings) + + logger.info("Total index number:{}".format(index.get_current_count())) + + return index diff --git a/application/neural_search/recall/simcse/data.py b/application/neural_search/recall/simcse/data.py new file mode 100644 index 000000000000..5ebb2447ddb2 --- /dev/null +++ b/application/neural_search/recall/simcse/data.py @@ -0,0 +1,163 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os + +import paddle + +from paddlenlp.utils.log import logger + + +def create_dataloader(dataset, + mode='train', + batch_size=1, + batchify_fn=None, + trans_fn=None): + if trans_fn: + dataset = dataset.map(trans_fn) + + shuffle = True if mode == 'train' else False + if mode == 'train': + batch_sampler = paddle.io.DistributedBatchSampler( + dataset, batch_size=batch_size, shuffle=shuffle) + else: + batch_sampler = paddle.io.BatchSampler( + dataset, batch_size=batch_size, shuffle=shuffle) + + return paddle.io.DataLoader( + dataset=dataset, + batch_sampler=batch_sampler, + collate_fn=batchify_fn, + return_list=True) + + +def convert_example_test(example, + tokenizer, + max_seq_length=512, + pad_to_max_seq_len=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + for key, text in example.items(): + encoded_inputs = tokenizer( + text=text, + max_seq_len=max_seq_length, + pad_to_max_seq_len=pad_to_max_seq_len) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + return result + + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + for key, text in example.items(): + if 'label' in key: + # do_evaluate + result += [example['label']] + else: + # do_train + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + + return result + +def gen_id2corpus(corpus_file): + id2corpus = {} + with open(corpus_file, 'r', encoding='utf-8') as f: + for idx, line in enumerate(f): + id2corpus[idx] = line.rstrip() + return id2corpus + +def gen_text_file(similar_text_pair_file): + text2similar_text = {} + texts = [] + with open(similar_text_pair_file, 'r', encoding='utf-8') as f: + for line in f: + splited_line = line.rstrip().split("\t") + if len(splited_line) != 2: + continue + + text, similar_text = line.rstrip().split("\t") + + if not text or not similar_text: + continue + + text2similar_text[text] = similar_text + texts.append({"text": text}) + return texts, text2similar_text + + +def read_simcse_text(data_path): + """Reads data.""" + with open(data_path, 'r', encoding='utf-8') as f: + for line in f: + data = line.rstrip() + yield {'text_a': data, 'text_b': data} + + +def read_text_pair(data_path, is_test=False): + """Reads data.""" + with open(data_path, 'r', encoding='utf-8') as f: + for line in f: + data = line.rstrip().split("\t") + if is_test == False: + if len(data) != 3: + continue + yield {'text_a': data[0], 'text_b': data[1], 'label': data[2]} + else: + if len(data) != 2: + continue + yield {'text_a': data[0], 'text_b': data[1]} \ No newline at end of file diff --git a/application/neural_search/recall/simcse/deploy/python/deploy.sh b/application/neural_search/recall/simcse/deploy/python/deploy.sh new file mode 100644 index 000000000000..fe8f071e0a47 --- /dev/null +++ b/application/neural_search/recall/simcse/deploy/python/deploy.sh @@ -0,0 +1 @@ +python predict.py --model_dir=../../output \ No newline at end of file diff --git a/application/neural_search/recall/simcse/deploy/python/predict.py b/application/neural_search/recall/simcse/deploy/python/predict.py new file mode 100644 index 000000000000..8a93f465b743 --- /dev/null +++ b/application/neural_search/recall/simcse/deploy/python/predict.py @@ -0,0 +1,227 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import sys + +import numpy as np +import paddle +import paddlenlp as ppnlp +from scipy.special import softmax +from paddle import inference +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.utils.log import logger + + +sys.path.append('.') + + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--model_dir", type=str, required=True, + help="The directory to static model.") + +parser.add_argument("--max_seq_length", default=128, type=int, + help="The maximum total input sequence length after tokenization. Sequences " + "longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=15, type=int, + help="Batch size per GPU/CPU for training.") +parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", + help="Select which device to train model, defaults to gpu.") + +parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], + help='Enable to use tensorrt to speed up.') +parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], + help='The tensorrt precision.') + +parser.add_argument('--cpu_threads', default=10, type=int, + help='Number of threads to predict when using cpu.') +parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], + help='Enable to use mkldnn to speed up when using cpu.') + +parser.add_argument("--benchmark", type=eval, default=False, + help="To log some information about environment and running.") +parser.add_argument("--save_log_path", type=str, default="./log_output/", + help="The file path to save log.") +args = parser.parse_args() +# yapf: enable + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + + return result + +class Predictor(object): + def __init__(self, + model_dir, + device="gpu", + max_seq_length=128, + batch_size=32, + use_tensorrt=False, + precision="fp32", + cpu_threads=10, + enable_mkldnn=False): + self.max_seq_length = max_seq_length + self.batch_size = batch_size + + model_file = model_dir + "/inference.get_pooled_embedding.pdmodel" + params_file = model_dir + "/inference.get_pooled_embedding.pdiparams" + if not os.path.exists(model_file): + raise ValueError("not find model file path {}".format(model_file)) + if not os.path.exists(params_file): + raise ValueError("not find params file path {}".format(params_file)) + config = paddle.inference.Config(model_file, params_file) + + if device == "gpu": + # set GPU configs accordingly + # such as intialize the gpu memory, enable tensorrt + config.enable_use_gpu(100, 0) + precision_map = { + "fp16": inference.PrecisionType.Half, + "fp32": inference.PrecisionType.Float32, + "int8": inference.PrecisionType.Int8 + } + precision_mode = precision_map[precision] + + if args.use_tensorrt: + config.enable_tensorrt_engine( + max_batch_size=batch_size, + min_subgraph_size=30, + precision_mode=precision_mode) + elif device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + if args.enable_mkldnn: + # cache 10 different shapes for mkldnn to avoid memory leak + config.set_mkldnn_cache_capacity(10) + config.enable_mkldnn() + config.set_cpu_math_library_num_threads(args.cpu_threads) + elif device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + + config.switch_use_feed_fetch_ops(False) + self.predictor = paddle.inference.create_predictor(config) + self.input_handles = [ + self.predictor.get_input_handle(name) + for name in self.predictor.get_input_names() + ] + self.output_handle = self.predictor.get_output_handle( + self.predictor.get_output_names()[0]) + + if args.benchmark: + import auto_log + pid = os.getpid() + self.autolog = auto_log.AutoLogger( + model_name="ernie-1.0", + model_precision=precision, + batch_size=self.batch_size, + data_shape="dynamic", + save_path=args.save_log_path, + inference_config=config, + pids=pid, + process_name=None, + gpu_ids=0, + time_keys=[ + 'preprocess_time', 'inference_time', 'postprocess_time' + ], + warmup=0, + logger=logger) + + def predict(self, data, tokenizer): + """ + Predicts the data labels. + + Args: + data (obj:`List(str)`): The batch data whose each element is a raw text. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + + Returns: + results(obj:`dict`): All the predictions labels. + """ + if args.benchmark: + self.autolog.times.start() + + examples = [] + for text in data: + input_ids, segment_ids = convert_example( + text, tokenizer) + examples.append((input_ids, segment_ids)) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # input + Pad(axis=0, pad_val=tokenizer.pad_token_id), # segment + ): fn(samples) + + if args.benchmark: + self.autolog.times.stamp() + + input_ids, segment_ids = batchify_fn(examples) + self.input_handles[0].copy_from_cpu(input_ids) + self.input_handles[1].copy_from_cpu(segment_ids) + self.predictor.run() + logits = self.output_handle.copy_to_cpu() + if args.benchmark: + self.autolog.times.stamp() + + if args.benchmark: + self.autolog.times.end(stamp=True) + + return logits + + +if __name__ == "__main__": + # Define predictor to do prediction. + predictor = Predictor(args.model_dir, args.device, args.max_seq_length, + args.batch_size, args.use_tensorrt, args.precision, + args.cpu_threads, args.enable_mkldnn) + + # ErnieTinyTokenizer is special for ernie-tiny pretained model. + output_emb_size=256 + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + res=predictor.predict(corpus_list, tokenizer) + print(res.shape) + print(res) diff --git a/application/neural_search/recall/simcse/evaluate.py b/application/neural_search/recall/simcse/evaluate.py new file mode 100644 index 000000000000..5bcdf2aef825 --- /dev/null +++ b/application/neural_search/recall/simcse/evaluate.py @@ -0,0 +1,91 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import argparse + +import numpy as np + +from paddlenlp.utils.log import logger + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--similar_text_pair", type=str, default='', help="The full path of similat pair file") +parser.add_argument("--recall_result_file", type=str, default='', help="The full path of recall result file") +parser.add_argument("--recall_num", type=int, default=10, help="Most similair number of doc recalled from corpus per query") + + +args = parser.parse_args() +# yapf: enable + + +def recall(rs, N=10): + """ + Ratio of recalled Ground Truth at topN Recalled Docs + >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]] + >>> recall(rs, N=1) + 0.333333 + >>> recall(rs, N=2) + >>> 0.6666667 + >>> recall(rs, N=3) + >>> 1.0 + Args: + rs: Iterator of recalled flag() + Returns: + Recall@N + """ + + recall_flags = [np.sum(r[0:N]) for r in rs] + return np.mean(recall_flags) + + +if __name__ == "__main__": + text2similar = {} + with open(args.similar_text_pair, 'r', encoding='utf-8') as f: + for line in f: + text, similar_text = line.rstrip().split("\t") + text2similar[text] = similar_text + + rs = [] + + with open(args.recall_result_file, 'r', encoding='utf-8') as f: + relevance_labels = [] + for index, line in enumerate(f): + + if index % args.recall_num == 0 and index != 0: + rs.append(relevance_labels) + relevance_labels = [] + + text, recalled_text, cosine_sim = line.rstrip().split("\t") + if text == recalled_text: + continue + if text2similar[text] == recalled_text: + relevance_labels.append(1) + else: + relevance_labels.append(0) + # print(len(rs)) + # print(rs[:50]) + + recall_N = [] + recall_num=[1,5,10,20,50] + result=open('result.tsv','a') + res=[] + for topN in recall_num: + R = round(100 * recall(rs, N=topN), 3) + recall_N.append(str(R)) + for key,val in zip(recall_num,recall_N): + print('recall@{}={}'.format(key,val)) + res.append(str(val)) + result.write('\t'.join(res)+'\n') + # print("\t".join(recall_N)) diff --git a/application/neural_search/recall/simcse/export_model.py b/application/neural_search/recall/simcse/export_model.py new file mode 100644 index 000000000000..2a9714e60d7a --- /dev/null +++ b/application/neural_search/recall/simcse/export_model.py @@ -0,0 +1,62 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad + +from model import SimCSE + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--params_path", type=str, required=True, default='./checkpoint/model_900/model_state.pdparams', help="The path to model parameters to be loaded.") +parser.add_argument("--output_path", type=str, default='./output', help="The path of model parameter in static graph to be saved.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + # If you want to use ernie1.0 model, plesace uncomment the following code + output_emb_size=256 + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0") + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + model = SimCSE( + pretrained_model, output_emb_size=output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + + model.eval() + + # Convert to static graph with specific input description + model = paddle.jit.to_static( + model, + input_spec=[ + paddle.static.InputSpec( + shape=[None, None], dtype="int64"), # input_ids + paddle.static.InputSpec( + shape=[None, None], dtype="int64") # segment_ids + ]) + # Save in static graph model. + save_path = os.path.join(args.output_path, "inference") + paddle.jit.save(model, save_path) \ No newline at end of file diff --git a/application/neural_search/recall/simcse/inference.py b/application/neural_search/recall/simcse/inference.py new file mode 100644 index 000000000000..1fe834b561de --- /dev/null +++ b/application/neural_search/recall/simcse/inference.py @@ -0,0 +1,112 @@ +from functools import partial +import argparse +import os +import sys +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset, MapDataset +from paddlenlp.utils.log import logger + +from model import SimCSE +from data import create_dataloader +from tqdm import tqdm + +def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False): + """ + Builds model inputs from a sequence. + + A BERT sequence has the following format: + + - single sequence: ``[CLS] X [SEP]`` + + Args: + example(obj:`list(str)`): The list of text to be converted to ids. + tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` + which contains most of the methods. Users should refer to the superclass for more information regarding methods. + max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. + Sequences longer than this will be truncated, sequences shorter will be padded. + is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. + + Returns: + input_ids(obj:`list[int]`): The list of query token ids. + token_type_ids(obj: `list[int]`): List of query sequence pair mask. + """ + + result = [] + + for key, text in example.items(): + encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length) + input_ids = encoded_inputs["input_ids"] + token_type_ids = encoded_inputs["token_type_ids"] + result += [input_ids, token_type_ids] + + return result + + +if __name__ == "__main__": + device= 'gpu' + max_seq_length=64 + output_emb_size=256 + batch_size=1 + params_path='checkpoints/model_20000/model_state.pdparams' + id2corpus={0:'国有企业引入非国有资本对创新绩效的影响——基于制造业国有上市公司的经验证据'} + paddle.set_device(device) + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0") + # pretrained_model=ErnieModel.from_pretrained("ernie-1.0") + + model = SimCSE( + pretrained_model, output_emb_size=output_emb_size) + + # Load pretrained semantic model + if params_path and os.path.isfile(params_path): + state_dict = paddle.load(params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % params_path) + else: + raise ValueError( + "Please set --params_path with correct pretrained model file") + + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, + mode='predict', + batch_size=batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + all_embeddings = [] + model.eval() + with paddle.no_grad(): + for batch_data in corpus_data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = model.get_pooled_embedding(input_ids, token_type_ids) + all_embeddings.append(text_embeddings) + + text_embedding=all_embeddings[0] + print(text_embedding.shape) + print(text_embedding.numpy()) diff --git a/application/neural_search/recall/simcse/model.py b/application/neural_search/recall/simcse/model.py new file mode 100644 index 000000000000..0823d52df4b0 --- /dev/null +++ b/application/neural_search/recall/simcse/model.py @@ -0,0 +1,154 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License" +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import abc +import sys + +import numpy as np + +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class SimCSE(nn.Layer): + def __init__(self, + pretrained_model, + dropout=None, + margin=0.0, + scale=20, + output_emb_size=None): + + super().__init__() + + self.ptm = pretrained_model + self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) + + # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size, + # we recommend set output_emb_size = 256 considering the trade-off beteween + # recall performance and efficiency + self.output_emb_size = output_emb_size + if output_emb_size > 0: + weight_attr = paddle.ParamAttr( + initializer=paddle.nn.initializer.TruncatedNormal(std=0.02)) + self.emb_reduce_linear = paddle.nn.Linear( + 768, output_emb_size, weight_attr=weight_attr) + + self.margin = margin + # Used scaling cosine similarity to ease converge + self.sacle = scale + + @paddle.jit.to_static(input_spec=[paddle.static.InputSpec(shape=[None, None], dtype='int64'),paddle.static.InputSpec(shape=[None, None], dtype='int64')]) + def get_pooled_embedding(self, + input_ids, + token_type_ids=None, + position_ids=None, + attention_mask=None, + with_pooler=True): + + # Note: cls_embedding is poolerd embedding with act tanh + sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids, + position_ids, attention_mask) + + if with_pooler == False: + cls_embedding = sequence_output[:, 0, :] + + if self.output_emb_size > 0: + cls_embedding = self.emb_reduce_linear(cls_embedding) + + cls_embedding = self.dropout(cls_embedding) + cls_embedding = F.normalize(cls_embedding, p=2, axis=-1) + + return cls_embedding + + def get_semantic_embedding(self, data_loader): + self.eval() + with paddle.no_grad(): + for batch_data in data_loader: + input_ids, token_type_ids = batch_data + input_ids = paddle.to_tensor(input_ids) + token_type_ids = paddle.to_tensor(token_type_ids) + + text_embeddings = self.get_pooled_embedding( + input_ids, token_type_ids=token_type_ids) + + yield text_embeddings + + def cosine_sim(self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None, + with_pooler=True): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, + query_token_type_ids, + query_position_ids, + query_attention_mask, + with_pooler=with_pooler) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, + title_token_type_ids, + title_position_ids, + title_attention_mask, + with_pooler=with_pooler) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, + axis=-1) + return cosine_sim + + def forward(self, + query_input_ids, + title_input_ids, + query_token_type_ids=None, + query_position_ids=None, + query_attention_mask=None, + title_token_type_ids=None, + title_position_ids=None, + title_attention_mask=None): + + query_cls_embedding = self.get_pooled_embedding( + query_input_ids, query_token_type_ids, query_position_ids, + query_attention_mask) + + title_cls_embedding = self.get_pooled_embedding( + title_input_ids, title_token_type_ids, title_position_ids, + title_attention_mask) + + cosine_sim = paddle.matmul( + query_cls_embedding, title_cls_embedding, transpose_y=True) + + # substract margin from all positive samples cosine_sim() + margin_diag = paddle.full( + shape=[query_cls_embedding.shape[0]], + fill_value=self.margin, + dtype=paddle.get_default_dtype()) + + cosine_sim = cosine_sim - paddle.diag(margin_diag) + + # scale cosine to ease training converge + cosine_sim *= self.sacle + + labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64') + labels = paddle.reshape(labels, shape=[-1, 1]) + + loss = F.cross_entropy(input=cosine_sim, label=labels) + + return loss diff --git a/application/neural_search/recall/simcse/predict.py b/application/neural_search/recall/simcse/predict.py new file mode 100644 index 000000000000..60eaa51a05d5 --- /dev/null +++ b/application/neural_search/recall/simcse/predict.py @@ -0,0 +1,132 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import argparse +import sys +import os +import random +import time + +import numpy as np +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.datasets import load_dataset +from paddlenlp.data import Stack, Tuple, Pad + +from data import read_text_pair, convert_example, create_dataloader +from model import SimCSE + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument("--text_pair_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--margin", default=0.0, type=float, help="Margin beteween pos_sample and neg_samples.") +parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") + +args = parser.parse_args() +# yapf: enable + + +def predict(model, data_loader): + """ + Predicts the data labels. + + Args: + model (obj:`SimCSE`): A model to extract text embedding or calculate similarity of text pair. + data_loaer (obj:`List(Example)`): The processed data ids of text pair: [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] + Returns: + results(obj:`List`): cosine similarity of text pairs. + """ + + cosine_sims = [] + + model.eval() + + with paddle.no_grad(): + for batch_data in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch_data + + query_input_ids = paddle.to_tensor(query_input_ids) + query_token_type_ids = paddle.to_tensor(query_token_type_ids) + title_input_ids = paddle.to_tensor(title_input_ids) + title_token_type_ids = paddle.to_tensor(title_token_type_ids) + + batch_cosine_sim = model.cosine_sim( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids).numpy() + + cosine_sims.append(batch_cosine_sim) + + cosine_sims = np.concatenate(cosine_sims, axis=0) + + return cosine_sims + + +if __name__ == "__main__": + paddle.set_device(args.device) + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment + ): [data for data in fn(samples)] + + valid_ds = load_dataset( + read_text_pair, data_path=args.text_pair_file, lazy=False, is_test=True) + + valid_data_loader = create_dataloader( + valid_ds, + mode='predict', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained( + "ernie-1.0") + + model = SimCSE( + pretrained_model, + margin=args.margin, + scale=args.scale, + output_emb_size=args.output_emb_size) + + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + print("Loaded parameters from %s" % args.params_path) + else: + raise ValueError( + "Please set --params_path with correct pretrained model file") + + cosin_sim = predict(model, valid_data_loader) + + for idx, cosine in enumerate(cosin_sim): + print('{}'.format(cosine)) \ No newline at end of file diff --git a/application/neural_search/recall/simcse/recall.py b/application/neural_search/recall/simcse/recall.py new file mode 100644 index 000000000000..1b2488423c59 --- /dev/null +++ b/application/neural_search/recall/simcse/recall.py @@ -0,0 +1,141 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# coding=UTF-8 + +from functools import partial +import argparse +import os +import sys +import random +import time + +import numpy as np +import hnswlib +import paddle +import paddle.nn.functional as F +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset, MapDataset, load_dataset +from paddlenlp.utils.log import logger + +from model import SimCSE +from data import convert_example_test, create_dataloader +from data import gen_id2corpus, gen_text_file +from ann_util import build_index + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--corpus_file", type=str, required=True, help="The full path of input file") +parser.add_argument("--similar_text_pair_file", type=str, required=True, help="The full path of similar text pair file") +parser.add_argument("--recall_result_dir", type=str, default='recall_result', help="The full path of recall result file to save") +parser.add_argument("--recall_result_file", type=str, default='recall_result_file', help="The file name of recall result") +parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") +parser.add_argument("--max_seq_length", default=64, type=int, help="The maximum total input sequence length after tokenization. " + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=None, type=int, help="output_embedding_size") +parser.add_argument("--recall_num", default=10, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument("--hnsw_m", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_ef", default=100, type=int, help="Recall number for each query from Ann index.") +parser.add_argument("--hnsw_max_elements", default=1000000, type=int, help="Recall number for each query from Ann index.") + +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +args = parser.parse_args() +# yapf: enable + +if __name__ == "__main__": + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + + trans_func = partial( + convert_example_test, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_segment + ): [data for data in fn(samples)] + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained("ernie-1.0") + + model = SimCSE( + pretrained_model, output_emb_size=args.output_emb_size) + model = paddle.DataParallel(model) + + # Load pretrained semantic model + if args.params_path and os.path.isfile(args.params_path): + state_dict = paddle.load(args.params_path) + model.set_dict(state_dict) + logger.info("Loaded parameters from %s" % args.params_path) + else: + raise ValueError( + "Please set --params_path with correct pretrained model file") + + id2corpus = gen_id2corpus(args.corpus_file) + + # conver_example function's input must be dict + corpus_list = [{idx: text} for idx, text in id2corpus.items()] + corpus_ds = MapDataset(corpus_list) + + corpus_data_loader = create_dataloader( + corpus_ds, + mode='predict', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + # Need better way to get inner model of DataParallel + inner_model = model._layers + + final_index = build_index(args, corpus_data_loader, inner_model) + + text_list, text2similar_text = gen_text_file(args.similar_text_pair_file) + # print(text_list[:5]) + + query_ds = MapDataset(text_list) + + query_data_loader = create_dataloader( + query_ds, + mode='predict', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + query_embedding = inner_model.get_semantic_embedding(query_data_loader) + + if not os.path.exists(args.recall_result_dir): + os.mkdir(args.recall_result_dir) + + recall_result_file = os.path.join(args.recall_result_dir, + args.recall_result_file) + with open(recall_result_file, 'w', encoding='utf-8') as f: + for batch_index, batch_query_embedding in enumerate(query_embedding): + recalled_idx, cosine_sims = final_index.knn_query( + batch_query_embedding.numpy(), args.recall_num) + + batch_size = len(cosine_sims) + + for row_index in range(batch_size): + text_index = args.batch_size * batch_index + row_index + for idx, doc_idx in enumerate(recalled_idx[row_index]): + f.write("{}\t{}\t{}\n".format(text_list[text_index][ + "text"], id2corpus[doc_idx], 1.0 - cosine_sims[ + row_index][idx])) diff --git a/application/neural_search/recall/simcse/scripts/evaluate.sh b/application/neural_search/recall/simcse/scripts/evaluate.sh new file mode 100755 index 000000000000..a95782c94d3e --- /dev/null +++ b/application/neural_search/recall/simcse/scripts/evaluate.sh @@ -0,0 +1,4 @@ + python -u evaluate.py \ + --similar_text_pair "recall/dev.csv" \ + --recall_result_file "./recall_result_dir/recall_result.txt" \ + --recall_num 50 \ No newline at end of file diff --git a/application/neural_search/recall/simcse/scripts/export_model.sh b/application/neural_search/recall/simcse/scripts/export_model.sh new file mode 100644 index 000000000000..f011b5fc900b --- /dev/null +++ b/application/neural_search/recall/simcse/scripts/export_model.sh @@ -0,0 +1 @@ +python export_model.py --params_path checkpoints/model_20000/model_state.pdparams --output_path=./output \ No newline at end of file diff --git a/application/neural_search/recall/simcse/scripts/predict.sh b/application/neural_search/recall/simcse/scripts/predict.sh new file mode 100644 index 000000000000..8b2ad20f1c2e --- /dev/null +++ b/application/neural_search/recall/simcse/scripts/predict.sh @@ -0,0 +1,20 @@ +# gpu +root_dir="checkpoints" +python -u -m paddle.distributed.launch --gpus "3" \ + predict.py \ + --device gpu \ + --params_path "${root_dir}/model_20000/model_state.pdparams" \ + --output_emb_size 256 \ + --batch_size 128 \ + --max_seq_length 64 \ + --text_pair_file "recall/test.csv" + +# cpu +root_dir="checkpoints" +python predict.py \ + --device cpu \ + --params_path "${root_dir}/model_20000/model_state.pdparams" \ + --output_emb_size 256 \ + --batch_size 128 \ + --max_seq_length 64 \ + --text_pair_file "recall/test.csv" \ No newline at end of file diff --git a/application/neural_search/recall/simcse/scripts/run_build_index.sh b/application/neural_search/recall/simcse/scripts/run_build_index.sh new file mode 100755 index 000000000000..b13fd69ed347 --- /dev/null +++ b/application/neural_search/recall/simcse/scripts/run_build_index.sh @@ -0,0 +1,30 @@ +# gpu +python -u -m paddle.distributed.launch --gpus "6" --log_dir "recall_log/" \ + recall.py \ + --device gpu \ + --recall_result_dir "recall_result_dir" \ + --recall_result_file "recall_result.txt" \ + --params_path "checkpoints/model_20000/model_state.pdparams" \ + --hnsw_m 100 \ + --hnsw_ef 100 \ + --batch_size 64 \ + --output_emb_size 256\ + --max_seq_length 60 \ + --recall_num 50 \ + --similar_text_pair "recall/dev.csv" \ + --corpus_file "recall/corpus.csv" + +# cpu +# python recall.py \ +# --device cpu \ +# --recall_result_dir "recall_result_dir" \ +# --recall_result_file "recall_result.txt" \ +# --params_path "checkpoints/model_20000/model_state.pdparams" \ +# --hnsw_m 100 \ +# --hnsw_ef 100 \ +# --batch_size 64 \ +# --output_emb_size 256\ +# --max_seq_length 60 \ +# --recall_num 50 \ +# --similar_text_pair "recall/dev.csv" \ +# --corpus_file "recall/corpus.csv" \ No newline at end of file diff --git a/application/neural_search/recall/simcse/scripts/train.sh b/application/neural_search/recall/simcse/scripts/train.sh new file mode 100644 index 000000000000..ac9310375fe1 --- /dev/null +++ b/application/neural_search/recall/simcse/scripts/train.sh @@ -0,0 +1,55 @@ +# simcse gpu +python -u -m paddle.distributed.launch --gpus '0,1,2,3' \ + train.py \ + --device gpu \ + --save_dir ./checkpoints/ \ + --batch_size 64 \ + --learning_rate 5E-5 \ + --epochs 3 \ + --save_steps 2000 \ + --eval_steps 100 \ + --max_seq_length 64 \ + --infer_with_fc_pooler \ + --dropout 0.2 \ + --output_emb_size 256 \ + --train_set_file "./recall/train_unsupervised.csv" \ + --test_set_file "./recall/dev.csv" + --model_name_or_path "ernie-1.0" + +# simcse cpu +# python train.py \ +# --device cpu \ +# --save_dir ./checkpoints/ \ +# --batch_size 64 \ +# --learning_rate 5E-5 \ +# --epochs 3 \ +# --save_steps 2000 \ +# --eval_steps 100 \ +# --max_seq_length 64 \ +# --infer_with_fc_pooler \ +# --dropout 0.2 \ +# --output_emb_size 256 \ +# --train_set_file "./recall/train_unsupervised.csv" \ +# --test_set_file "./recall/dev.csv" +# --model_name_or_path "ernie-1.0" + +# post training + simcse +# python -u -m paddle.distributed.launch --gpus '0,1,2,3' \ +# train.py \ +# --device gpu \ +# --save_dir ./checkpoints/ \ +# --batch_size 64 \ +# --learning_rate 5E-5 \ +# --epochs 3 \ +# --save_steps 2000 \ +# --eval_steps 100 \ +# --max_seq_length 64 \ +# --infer_with_fc_pooler \ +# --dropout 0.2 \ +# --output_emb_size 256 \ +# --train_set_file "./recall/train_unsupervised.csv" \ +# --test_set_file "./recall/dev.csv" +# --model_name_or_path "post_ernie" + + + diff --git a/application/neural_search/recall/simcse/train.py b/application/neural_search/recall/simcse/train.py new file mode 100644 index 000000000000..a74e138ced98 --- /dev/null +++ b/application/neural_search/recall/simcse/train.py @@ -0,0 +1,206 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from functools import partial +import argparse +import os +import sys +import random +import time + +from scipy import stats +import numpy as np +import paddle +import paddle.nn.functional as F + +import paddlenlp as ppnlp +from paddlenlp.data import Stack, Tuple, Pad +from paddlenlp.datasets import load_dataset +from paddlenlp.transformers import LinearDecayWithWarmup +from visualdl import LogWriter +import time + +from model import SimCSE +from data import read_simcse_text, read_text_pair, convert_example, create_dataloader + +# yapf: disable +parser = argparse.ArgumentParser() +parser.add_argument("--save_dir", default='./checkpoint', type=str, help="The output directory where the model checkpoints will be written.") +parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization." + "Sequences longer than this will be truncated, sequences shorter will be padded.") +parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") +parser.add_argument("--output_emb_size", default=0, type=int, help="Output_embedding_size, 0 means use hidden_size as output embedding size.") +parser.add_argument("--learning_rate", default=1e-5, type=float, help="The initial learning rate for Adam.") +parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") +parser.add_argument("--epochs", default=1, type=int, help="Total number of training epochs to perform.") +parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proption over the training process.") +parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") +parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") +parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") +parser.add_argument('--save_steps', type=int, default=10000, help="Step interval for saving checkpoint.") +parser.add_argument('--eval_steps', type=int, default=10000, help="Step interval for evaluation.") +parser.add_argument("--train_set_file", type=str, required=True, help="The full path of train_set_file.") +parser.add_argument("--test_set_file", type=str, required=True, help="The full path of test_set_file.") +parser.add_argument("--margin", default=0.0, type=float, help="Margin beteween pos_sample and neg_samples.") +parser.add_argument("--scale", default=20, type=int, help="Scale for pair-wise margin_rank_loss.") +parser.add_argument("--dropout", default=0.1, type=float, help="Dropout for pretrained model encoder.") +parser.add_argument("--infer_with_fc_pooler", action='store_true', help="Whether use fc layer after cls embedding or not for when infer.") +parser.add_argument("--model_name_or_path",default='ernie-1.0',type=str,help='pretrained model') + +args = parser.parse_args() + +def set_seed(seed): + """sets random seed""" + random.seed(seed) + np.random.seed(seed) + paddle.seed(seed) + +def do_evaluate(model, tokenizer, data_loader, with_pooler=False): + model.eval() + + total_num = 0 + spearman_corr = 0.0 + sims = [] + labels = [] + + for batch in data_loader: + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids, label = batch + total_num += len(label) + + query_cls_embedding = model.get_pooled_embedding( + query_input_ids, query_token_type_ids, with_pooler=with_pooler) + + title_cls_embedding = model.get_pooled_embedding(title_input_ids, title_token_type_ids, with_pooler=with_pooler) + + cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding, axis=-1) + + sims.append(cosine_sim.numpy()) + labels.append(label.numpy()) + + sims = np.concatenate(sims, axis=0) + labels = np.concatenate(labels, axis=0) + + spearman_corr = stats.spearmanr(labels, sims).correlation + model.train() + return spearman_corr, total_num + +def do_train(): + paddle.set_device(args.device) + rank = paddle.distributed.get_rank() + if paddle.distributed.get_world_size() > 1: + paddle.distributed.init_parallel_env() + + set_seed(args.seed) + writer=LogWriter(logdir="./log/scalar_test/train") + + train_ds = load_dataset( + read_simcse_text, data_path=args.train_set_file, lazy=False) + + + pretrained_model = ppnlp.transformers.ErnieModel.from_pretrained( + args.model_name_or_path, + hidden_dropout_prob=args.dropout, + attention_probs_dropout_prob=args.dropout) + print("loading model from {}".format(args.model_name_or_path)) + tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0') + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + max_seq_length=args.max_seq_length) + + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id), # query_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # query_segment + Pad(axis=0, pad_val=tokenizer.pad_token_id), # title_input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # tilte_segment + ): [data for data in fn(samples)] + + + train_data_loader = create_dataloader( + train_ds, + mode='train', + batch_size=args.batch_size, + batchify_fn=batchify_fn, + trans_fn=trans_func) + + + model = SimCSE( + pretrained_model, + margin=args.margin, + scale=args.scale, + output_emb_size=args.output_emb_size) + + if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): + state_dict = paddle.load(args.init_from_ckpt) + model.set_dict(state_dict) + print("warmup from:{}".format(args.init_from_ckpt)) + + model = paddle.DataParallel(model) + + num_training_steps = len(train_data_loader) * args.epochs + + lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, + args.warmup_proportion) + + # Generate parameter names needed to perform weight decay. + # All bias and LayerNorm parameters are excluded. + decay_params = [ + p.name for n, p in model.named_parameters() + if not any(nd in n for nd in ["bias", "norm"]) + ] + optimizer = paddle.optimizer.AdamW( + learning_rate=lr_scheduler, + parameters=model.parameters(), + weight_decay=args.weight_decay, + apply_decay_param_fun=lambda x: x in decay_params) + + time_start=time.time() + global_step = 0 + tic_train = time.time() + for epoch in range(1, args.epochs + 1): + for step, batch in enumerate(train_data_loader, start=1): + query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids = batch + + loss = model( + query_input_ids=query_input_ids, + title_input_ids=title_input_ids, + query_token_type_ids=query_token_type_ids, + title_token_type_ids=title_token_type_ids) + + global_step += 1 + if global_step % 10 == 0 and rank == 0: + print("global step %d, epoch: %d, batch: %d, loss: %.5f, speed: %.2f step/s" + % (global_step, epoch, step, loss, + 10 / (time.time() - tic_train))) + writer.add_scalar(tag="loss", step=global_step, value=loss) + tic_train = time.time() + + loss.backward() + optimizer.step() + lr_scheduler.step() + optimizer.clear_grad() + if global_step % args.save_steps == 0 and rank == 0: + save_dir = os.path.join(args.save_dir, "model_%d" % (global_step)) + if not os.path.exists(save_dir): + os.makedirs(save_dir) + save_param_path = os.path.join(save_dir, 'model_state.pdparams') + paddle.save(model.state_dict(), save_param_path) + tokenizer.save_pretrained(save_dir) + time_end=time.time() + print('totally cost',time_end-time_start) + + +if __name__ == "__main__": + do_train() diff --git a/application/neural_search/requirements.txt b/application/neural_search/requirements.txt new file mode 100644 index 000000000000..c6635cf3a75a --- /dev/null +++ b/application/neural_search/requirements.txt @@ -0,0 +1,8 @@ +pymilvus +pandas==0.25.1 +paddlenlp==2.1.1 +paddlepaddle-gpu==2.1.3 +hnswlib>=0.5.2 +numpy>=1.17.2 +visualdl>=2.2.2 +pybind11 \ No newline at end of file diff --git a/application/rocket-qa/README.md b/application/rocket-qa/README.md new file mode 100644 index 000000000000..52e29b462592 --- /dev/null +++ b/application/rocket-qa/README.md @@ -0,0 +1,3 @@ +# RocketQA + +[https://github.com/PaddlePaddle/RocketQA](https://github.com/PaddlePaddle/RocketQA) \ No newline at end of file diff --git a/examples/language_model/bert/static/create_pretraining_data.py b/examples/language_model/bert/static/create_pretraining_data.py deleted file mode 120000 index 5870cbbe32a0..000000000000 --- a/examples/language_model/bert/static/create_pretraining_data.py +++ /dev/null @@ -1 +0,0 @@ -../create_pretraining_data.py \ No newline at end of file diff --git a/examples/language_model/bert/static/create_pretraining_data.py b/examples/language_model/bert/static/create_pretraining_data.py new file mode 100644 index 000000000000..ccf28f855cd5 --- /dev/null +++ b/examples/language_model/bert/static/create_pretraining_data.py @@ -0,0 +1,499 @@ +# coding=utf-8 +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved. +# Copyright 2018 The Google AI Language Team Authors and The HugginFace Inc. team. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Create masked LM/next sentence masked_lm examples for BERT.""" +from __future__ import absolute_import, division, print_function, unicode_literals + +import argparse +import logging +import os +import random +from io import open +import h5py +import numpy as np +from tqdm import tqdm + +from paddlenlp.transformers import BertTokenizer +from paddlenlp.transformers.tokenizer_utils import convert_to_unicode + +import random +import collections + + +class TrainingInstance(object): + """A single training instance (sentence pair).""" + + def __init__(self, tokens, segment_ids, masked_lm_positions, + masked_lm_labels, is_random_next): + self.tokens = tokens + self.segment_ids = segment_ids + self.is_random_next = is_random_next + self.masked_lm_positions = masked_lm_positions + self.masked_lm_labels = masked_lm_labels + + +def write_instance_to_example_file(instances, tokenizer, max_seq_length, + max_predictions_per_seq, output_file): + """Create example files from `TrainingInstance`s.""" + + total_written = 0 + features = collections.OrderedDict() + + num_instances = len(instances) + features["input_ids"] = np.zeros( + [num_instances, max_seq_length], dtype="int32") + features["input_mask"] = np.zeros( + [num_instances, max_seq_length], dtype="int32") + features["segment_ids"] = np.zeros( + [num_instances, max_seq_length], dtype="int32") + features["masked_lm_positions"] = np.zeros( + [num_instances, max_predictions_per_seq], dtype="int32") + features["masked_lm_ids"] = np.zeros( + [num_instances, max_predictions_per_seq], dtype="int32") + features["next_sentence_labels"] = np.zeros(num_instances, dtype="int32") + + for inst_index, instance in enumerate(tqdm(instances)): + input_ids = tokenizer.convert_tokens_to_ids(instance.tokens) + input_mask = [1] * len(input_ids) + segment_ids = list(instance.segment_ids) + assert len(input_ids) <= max_seq_length + + while len(input_ids) < max_seq_length: + input_ids.append(0) + input_mask.append(0) + segment_ids.append(0) + + assert len(input_ids) == max_seq_length + assert len(input_mask) == max_seq_length + assert len(segment_ids) == max_seq_length + + masked_lm_positions = list(instance.masked_lm_positions) + masked_lm_ids = tokenizer.convert_tokens_to_ids( + instance.masked_lm_labels) + masked_lm_weights = [1.0] * len(masked_lm_ids) + + while len(masked_lm_positions) < max_predictions_per_seq: + masked_lm_positions.append(0) + masked_lm_ids.append(0) + masked_lm_weights.append(0.0) + + next_sentence_label = 1 if instance.is_random_next else 0 + + features["input_ids"][inst_index] = input_ids + features["input_mask"][inst_index] = input_mask + features["segment_ids"][inst_index] = segment_ids + features["masked_lm_positions"][inst_index] = masked_lm_positions + features["masked_lm_ids"][inst_index] = masked_lm_ids + features["next_sentence_labels"][inst_index] = next_sentence_label + + total_written += 1 + + print("saving data") + f = h5py.File(output_file, 'w') + f.create_dataset( + "input_ids", data=features["input_ids"], dtype='i4', compression='gzip') + f.create_dataset( + "input_mask", + data=features["input_mask"], + dtype='i1', + compression='gzip') + f.create_dataset( + "segment_ids", + data=features["segment_ids"], + dtype='i1', + compression='gzip') + f.create_dataset( + "masked_lm_positions", + data=features["masked_lm_positions"], + dtype='i4', + compression='gzip') + f.create_dataset( + "masked_lm_ids", + data=features["masked_lm_ids"], + dtype='i4', + compression='gzip') + f.create_dataset( + "next_sentence_labels", + data=features["next_sentence_labels"], + dtype='i1', + compression='gzip') + f.flush() + f.close() + + +def create_training_instances(input_files, tokenizer, max_seq_length, + dupe_factor, short_seq_prob, masked_lm_prob, + max_predictions_per_seq, rng): + """Create `TrainingInstance`s from raw text.""" + all_documents = [[]] + + # Input file format: + # (1) One sentence per line. These should ideally be actual sentences, not + # entire paragraphs or arbitrary spans of text. (Because we use the + # sentence boundaries for the "next sentence prediction" task). + # (2) Blank lines between documents. Document boundaries are needed so + # that the "next sentence prediction" task doesn't span between documents. + for input_file in input_files: + print("creating instance from {}".format(input_file)) + with open(input_file, "r", encoding="UTF-8") as reader: + while True: + line = convert_to_unicode(reader.readline()) + if not line: + break + line = line.strip() + + # Empty lines are used as document delimiters + if not line: + all_documents.append([]) + tokens = tokenizer.tokenize(line) + if tokens: + all_documents[-1].append(tokens) + + # Remove empty documents + all_documents = [x for x in all_documents if x] + rng.shuffle(all_documents) + + # vocab_words = list(tokenizer.vocab.keys()) + vocab_words = list(tokenizer.vocab.token_to_idx.keys()) + instances = [] + for _ in range(dupe_factor): + for document_index in range(len(all_documents)): + instances.extend( + create_instances_from_document( + all_documents, document_index, max_seq_length, + short_seq_prob, masked_lm_prob, max_predictions_per_seq, + vocab_words, rng)) + + rng.shuffle(instances) + return instances + + +def create_instances_from_document( + all_documents, document_index, max_seq_length, short_seq_prob, + masked_lm_prob, max_predictions_per_seq, vocab_words, rng): + """Creates `TrainingInstance`s for a single document.""" + document = all_documents[document_index] + + # Account for [CLS], [SEP], [SEP] + max_num_tokens = max_seq_length - 3 + + # We *usually* want to fill up the entire sequence since we are padding + # to `max_seq_length` anyways, so short sequences are generally wasted + # computation. However, we *sometimes* + # (i.e., short_seq_prob == 0.1 == 10% of the time) want to use shorter + # sequences to minimize the mismatch between pre-training and fine-tuning. + # The `target_seq_length` is just a rough target however, whereas + # `max_seq_length` is a hard limit. + target_seq_length = max_num_tokens + if rng.random() < short_seq_prob: + target_seq_length = rng.randint(2, max_num_tokens) + + # We DON'T just concatenate all of the tokens from a document into a long + # sequence and choose an arbitrary split point because this would make the + # next sentence prediction task too easy. Instead, we split the input into + # segments "A" and "B" based on the actual "sentences" provided by the user + # input. + instances = [] + current_chunk = [] + current_length = 0 + i = 0 + while i < len(document): + segment = document[i] + current_chunk.append(segment) + current_length += len(segment) + if i == len(document) - 1 or current_length >= target_seq_length: + if current_chunk: + # `a_end` is how many segments from `current_chunk` go into the `A` + # (first) sentence. + a_end = 1 + if len(current_chunk) >= 2: + a_end = rng.randint(1, len(current_chunk) - 1) + + tokens_a = [] + for j in range(a_end): + tokens_a.extend(current_chunk[j]) + + tokens_b = [] + # Random next + is_random_next = False + if len(current_chunk) == 1 or rng.random() < 0.5: + is_random_next = True + target_b_length = target_seq_length - len(tokens_a) + + # This should rarely go for more than one iteration for large + # corpora. However, just to be careful, we try to make sure that + # the random document is not the same as the document + # we're processing. + for _ in range(10): + random_document_index = rng.randint( + 0, len(all_documents) - 1) + if random_document_index != document_index: + break + + #If picked random document is the same as the current document + if random_document_index == document_index: + is_random_next = False + + random_document = all_documents[random_document_index] + random_start = rng.randint(0, len(random_document) - 1) + for j in range(random_start, len(random_document)): + tokens_b.extend(random_document[j]) + if len(tokens_b) >= target_b_length: + break + # We didn't actually use these segments so we "put them back" so + # they don't go to waste. + num_unused_segments = len(current_chunk) - a_end + i -= num_unused_segments + # Actual next + else: + is_random_next = False + for j in range(a_end, len(current_chunk)): + tokens_b.extend(current_chunk[j]) + truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng) + + assert len(tokens_a) >= 1 + assert len(tokens_b) >= 1 + + tokens = [] + segment_ids = [] + tokens.append("[CLS]") + segment_ids.append(0) + for token in tokens_a: + tokens.append(token) + segment_ids.append(0) + + tokens.append("[SEP]") + segment_ids.append(0) + + for token in tokens_b: + tokens.append(token) + segment_ids.append(1) + tokens.append("[SEP]") + segment_ids.append(1) + + (tokens, masked_lm_positions, + masked_lm_labels) = create_masked_lm_predictions( + tokens, masked_lm_prob, max_predictions_per_seq, + vocab_words, rng) + instance = TrainingInstance( + tokens=tokens, + segment_ids=segment_ids, + is_random_next=is_random_next, + masked_lm_positions=masked_lm_positions, + masked_lm_labels=masked_lm_labels) + instances.append(instance) + current_chunk = [] + current_length = 0 + i += 1 + + return instances + + +MaskedLmInstance = collections.namedtuple("MaskedLmInstance", + ["index", "label"]) + + +def create_masked_lm_predictions(tokens, masked_lm_prob, + max_predictions_per_seq, vocab_words, rng): + """Creates the predictions for the masked LM objective.""" + + cand_indexes = [] + for (i, token) in enumerate(tokens): + if token == "[CLS]" or token == "[SEP]": + continue + cand_indexes.append(i) + + rng.shuffle(cand_indexes) + + output_tokens = list(tokens) + + num_to_predict = min(max_predictions_per_seq, + max(1, int(round(len(tokens) * masked_lm_prob)))) + + masked_lms = [] + covered_indexes = set() + for index in cand_indexes: + if len(masked_lms) >= num_to_predict: + break + if index in covered_indexes: + continue + covered_indexes.add(index) + + masked_token = None + # 80% of the time, replace with [MASK] + if rng.random() < 0.8: + masked_token = "[MASK]" + else: + # 10% of the time, keep original + if rng.random() < 0.5: + masked_token = tokens[index] + # 10% of the time, replace with random word + else: + masked_token = vocab_words[rng.randint(0, len(vocab_words) - 1)] + + output_tokens[index] = masked_token + + masked_lms.append(MaskedLmInstance(index=index, label=tokens[index])) + + masked_lms = sorted(masked_lms, key=lambda x: x.index) + + masked_lm_positions = [] + masked_lm_labels = [] + for p in masked_lms: + masked_lm_positions.append(p.index) + masked_lm_labels.append(p.label) + + return (output_tokens, masked_lm_positions, masked_lm_labels) + + +def truncate_seq_pair(tokens_a, tokens_b, max_num_tokens, rng): + """Truncates a pair of sequences to a maximum sequence length.""" + while True: + total_length = len(tokens_a) + len(tokens_b) + if total_length <= max_num_tokens: + break + + trunc_tokens = tokens_a if len(tokens_a) > len(tokens_b) else tokens_b + assert len(trunc_tokens) >= 1 + + # We want to sometimes truncate from the front and sometimes from the + # back to add more randomness and avoid biases. + if rng.random() < 0.5: + del trunc_tokens[0] + else: + trunc_tokens.pop() + + +def main(): + + parser = argparse.ArgumentParser() + + parser.add_argument( + "--input_file", + default=None, + type=str, + required=True, + help="The input train corpus. can be directory with .txt files or a path to a single file" + ) + parser.add_argument( + "--output_file", + default=None, + type=str, + required=True, + help="The output file where created hdf5 formatted data will be written." + ) + parser.add_argument( + "--vocab_file", + default=None, + type=str, + required=False, + help="The vocabulary the BERT model will train on. " + "Use bert_model argument would ignore this. " + "The bert_model argument is recommended.") + parser.add_argument( + "--do_lower_case", + action='store_true', + default=True, + help="Whether to lower case the input text. True for uncased models, False for cased models. " + "Use bert_model argument would ignore this. The bert_model argument is recommended." + ) + parser.add_argument( + "--bert_model", + default="bert-base-uncased", + type=str, + required=False, + help="Bert pre-trained model selected in the list: bert-base-uncased, " + "bert-large-uncased, bert-base-cased, bert-base-multilingual, bert-base-chinese." + "If provided, use the pre-trained model used tokenizer to create data " + "and ignore vocab_file and do_lower_case.") + + ## Other parameters + #int + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after WordPiece tokenization. \n" + "Sequences longer than this will be truncated, and sequences shorter \n" + "than this will be padded.") + parser.add_argument( + "--dupe_factor", + default=10, + type=int, + help="Number of times to duplicate the input data (with different masks)." + ) + parser.add_argument( + "--max_predictions_per_seq", + default=20, + type=int, + help="Maximum number of masked LM predictions per sequence.") + + # floats + parser.add_argument( + "--masked_lm_prob", + default=0.15, + type=float, + help="Masked LM probability.") + parser.add_argument( + "--short_seq_prob", + default=0.1, + type=float, + help="Probability to create a sequence shorter than maximum sequence length" + ) + + parser.add_argument( + '--random_seed', + type=int, + default=12345, + help="random seed for initialization") + + args = parser.parse_args() + print(args) + + if args.bert_model: + tokenizer = BertTokenizer.from_pretrained(args.bert_model) + else: + assert args.vocab_file, ( + "vocab_file must be set If bert_model is not provided.") + tokenizer = BertTokenizer( + args.vocab_file, do_lower_case=args.do_lower_case) + + input_files = [] + if os.path.isfile(args.input_file): + input_files.append(args.input_file) + elif os.path.isdir(args.input_file): + input_files = [ + os.path.join(args.input_file, f) + for f in os.listdir(args.input_file) + if (os.path.isfile(os.path.join(args.input_file, f)) and f.endswith( + '.txt')) + ] + else: + raise ValueError("{} is not a valid path".format(args.input_file)) + + rng = random.Random(args.random_seed) + instances = create_training_instances( + input_files, tokenizer, args.max_seq_length, args.dupe_factor, + args.short_seq_prob, args.masked_lm_prob, args.max_predictions_per_seq, + rng) + + output_file = args.output_file + + write_instance_to_example_file(instances, tokenizer, args.max_seq_length, + args.max_predictions_per_seq, output_file) + + +if __name__ == "__main__": + main() diff --git a/examples/language_model/bert/static/predict_glue.py b/examples/language_model/bert/static/predict_glue.py deleted file mode 120000 index 8360812411b9..000000000000 --- a/examples/language_model/bert/static/predict_glue.py +++ /dev/null @@ -1 +0,0 @@ -../predict_glue.py \ No newline at end of file diff --git a/examples/language_model/bert/static/predict_glue.py b/examples/language_model/bert/static/predict_glue.py new file mode 100644 index 000000000000..bea832aa018b --- /dev/null +++ b/examples/language_model/bert/static/predict_glue.py @@ -0,0 +1,158 @@ +# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +from functools import partial + +import paddle +from paddle import inference +from paddlenlp.datasets import load_dataset +from paddlenlp.data import Stack, Tuple, Pad + +from run_glue import convert_example, METRIC_CLASSES, MODEL_CLASSES + + +def parse_args(): + parser = argparse.ArgumentParser() + + # Required parameters + parser.add_argument( + "--task_name", + default=None, + type=str, + required=True, + help="The name of the task to perform predict, selected in the list: " + + ", ".join(METRIC_CLASSES.keys()), ) + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + + ", ".join(MODEL_CLASSES.keys()), ) + parser.add_argument( + "--model_path", + default=None, + type=str, + required=True, + help="The path prefix of inference model to be used.", ) + parser.add_argument( + "--device", + default="gpu", + choices=["gpu", "cpu", "xpu"], + help="Device selected for inference.", ) + parser.add_argument( + "--batch_size", + default=32, + type=int, + help="Batch size for predict.", ) + parser.add_argument( + "--max_seq_length", + default=128, + type=int, + help="The maximum total input sequence length after tokenization. Sequences longer " + "than this will be truncated, sequences shorter will be padded.", ) + args = parser.parse_args() + return args + + +class Predictor(object): + def __init__(self, predictor, input_handles, output_handles): + self.predictor = predictor + self.input_handles = input_handles + self.output_handles = output_handles + + @classmethod + def create_predictor(cls, args): + config = paddle.inference.Config(args.model_path + ".pdmodel", + args.model_path + ".pdiparams") + if args.device == "gpu": + # set GPU configs accordingly + config.enable_use_gpu(100, 0) + elif args.device == "cpu": + # set CPU configs accordingly, + # such as enable_mkldnn, set_cpu_math_library_num_threads + config.disable_gpu() + elif args.device == "xpu": + # set XPU configs accordingly + config.enable_xpu(100) + config.switch_use_feed_fetch_ops(False) + predictor = paddle.inference.create_predictor(config) + input_handles = [ + predictor.get_input_handle(name) + for name in predictor.get_input_names() + ] + output_handles = [ + predictor.get_output_handle(name) + for name in predictor.get_output_names() + ] + return cls(predictor, input_handles, output_handles) + + def predict_batch(self, data): + for input_field, input_handle in zip(data, self.input_handles): + input_handle.copy_from_cpu(input_field.numpy() if isinstance( + input_field, paddle.Tensor) else input_field) + self.predictor.run() + output = [ + output_handle.copy_to_cpu() for output_handle in self.output_handles + ] + return output + + def predict(self, dataset, collate_fn, batch_size=1): + batch_sampler = paddle.io.BatchSampler( + dataset, batch_size=batch_size, shuffle=False) + data_loader = paddle.io.DataLoader( + dataset=dataset, + batch_sampler=batch_sampler, + collate_fn=collate_fn, + num_workers=0, + return_list=True) + outputs = [] + for data in data_loader: + output = self.predict_batch(data) + outputs.append(output) + return outputs + + +def main(): + args = parse_args() + + predictor = Predictor.create_predictor(args) + + args.task_name = args.task_name.lower() + args.model_type = args.model_type.lower() + model_class, tokenizer_class = MODEL_CLASSES[args.model_type] + + test_ds = load_dataset('glue', args.task_name, splits="test") + tokenizer = tokenizer_class.from_pretrained( + os.path.dirname(args.model_path)) + + trans_func = partial( + convert_example, + tokenizer=tokenizer, + label_list=test_ds.label_list, + max_seq_length=args.max_seq_length, + is_test=True) + test_ds = test_ds.map(trans_func) + batchify_fn = lambda samples, fn=Tuple( + Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input + Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # segment + ): fn(samples) + predictor.predict( + test_ds, batch_size=args.batch_size, collate_fn=batchify_fn) + + +if __name__ == "__main__": + main() diff --git a/examples/language_model/gpt-3/dygraph/lr.py b/examples/language_model/gpt-3/dygraph/lr.py deleted file mode 120000 index 09070bccaa33..000000000000 --- a/examples/language_model/gpt-3/dygraph/lr.py +++ /dev/null @@ -1 +0,0 @@ -../../gpt/lr.py \ No newline at end of file diff --git a/examples/language_model/gpt-3/dygraph/lr.py b/examples/language_model/gpt-3/dygraph/lr.py new file mode 100644 index 000000000000..0736246910f0 --- /dev/null +++ b/examples/language_model/gpt-3/dygraph/lr.py @@ -0,0 +1,49 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math +import numpy +import warnings +from paddle import Tensor +from paddle.optimizer.lr import LRScheduler + + +class CosineAnnealingWithWarmupDecay(LRScheduler): + def __init__(self, + max_lr, + min_lr, + warmup_step, + decay_step, + last_epoch=0, + verbose=False): + + self.decay_step = decay_step + self.warmup_step = warmup_step + self.max_lr = max_lr + self.min_lr = min_lr + super(CosineAnnealingWithWarmupDecay, self).__init__(max_lr, last_epoch, + verbose) + + def get_lr(self): + if self.warmup_step > 0 and self.last_epoch <= self.warmup_step: + return float(self.max_lr) * (self.last_epoch) / self.warmup_step + + if self.last_epoch > self.decay_step: + return self.min_lr + + num_step_ = self.last_epoch - self.warmup_step + decay_step_ = self.decay_step - self.warmup_step + decay_ratio = float(num_step_) / float(decay_step_) + coeff = 0.5 * (math.cos(math.pi * decay_ratio) + 1.0) + return self.min_lr + coeff * (self.max_lr - self.min_lr) diff --git a/examples/language_model/gpt-3/static/args.py b/examples/language_model/gpt-3/static/args.py deleted file mode 120000 index 146f753c8b2e..000000000000 --- a/examples/language_model/gpt-3/static/args.py +++ /dev/null @@ -1 +0,0 @@ -../../gpt/args.py \ No newline at end of file diff --git a/examples/language_model/gpt-3/static/args.py b/examples/language_model/gpt-3/static/args.py new file mode 100644 index 000000000000..9a452d33f186 --- /dev/null +++ b/examples/language_model/gpt-3/static/args.py @@ -0,0 +1,273 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import paddle +from paddlenlp.utils.log import logger + + +def str2bool(v): + if v.lower() in ('yes', 'true', 't', 'y', '1'): + return True + elif v.lower() in ('no', 'false', 'f', 'n', '0'): + return False + else: + raise argparse.ArgumentTypeError('Unsupported value encountered.') + + +def parse_args(MODEL_CLASSES): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model_type", + default=None, + type=str, + required=True, + help="Model type selected in the list: " + + ", ".join(MODEL_CLASSES.keys()), ) + parser.add_argument( + "--model_name_or_path", + default=None, + type=str, + required=True, + help="Path to pre-trained model or shortcut name selected in the list: " + + ", ".join( + sum([ + list(classes[-1].pretrained_init_configuration.keys()) + for classes in MODEL_CLASSES.values() + ], [])), ) + + # Train I/O config + parser.add_argument( + "--input_dir", + default=None, + type=str, + required=True, + help="The input directory where the data will be read from.", ) + parser.add_argument( + "--output_dir", + default=None, + type=str, + required=True, + help="The output directory where the training logs and checkpoints will be written." + ) + parser.add_argument( + "--split", + type=str, + default='949,50,1', + help="Train/valid/test data split.") + + parser.add_argument( + "--max_seq_len", type=int, default=1024, help="Max sequence length.") + parser.add_argument( + "--micro_batch_size", + default=8, + type=int, + help="Batch size per device for one step training.", ) + parser.add_argument( + "--global_batch_size", + default=None, + type=int, + help="Global batch size for all training process. None for not check the size is valid. If we only use data parallelism, it should be device_num * micro_batch_size." + ) + + # Default training config + parser.add_argument( + "--weight_decay", + default=0.0, + type=float, + help="Weight decay if we apply some.") + parser.add_argument( + "--grad_clip", + default=0.0, + type=float, + help="Grad clip for the parameter.") + parser.add_argument( + "--max_lr", + default=1e-5, + type=float, + help="The initial max learning rate for Adam.") + parser.add_argument( + "--min_lr", + default=5e-5, + type=float, + help="The initial min learning rate for Adam.") + parser.add_argument( + "--warmup_rate", + default=0.01, + type=float, + help="Linear warmup over warmup_steps for learing rate.") + + # Adam optimizer config + parser.add_argument( + "--adam_beta1", + default=0.9, + type=float, + help="The beta1 for Adam optimizer. The exponential decay rate for the 1st moment estimates." + ) + parser.add_argument( + "--adam_beta2", + default=0.999, + type=float, + help="The bate2 for Adam optimizer. The exponential decay rate for the 2nd moment estimates." + ) + parser.add_argument( + "--adam_epsilon", + default=1e-8, + type=float, + help="Epsilon for Adam optimizer.") + + # Training steps config + parser.add_argument( + "--max_steps", + default=500000, + type=int, + help="set total number of training steps to perform.") + parser.add_argument( + "--save_steps", + type=int, + default=500, + help="Save checkpoint every X updates steps.") + parser.add_argument( + "--decay_steps", + default=360000, + type=int, + help="The steps use to control the learing rate. If the step > decay_steps, will use the min_lr." + ) + parser.add_argument( + "--logging_freq", + type=int, + default=1, + help="Log every X updates steps.") + parser.add_argument( + "--eval_freq", + type=int, + default=500, + help="Evaluate for every X updates steps.") + parser.add_argument( + "--eval_iters", + type=int, + default=10, + help="Evaluate the model use X steps data.") + + # Config for 4D Parallelism + parser.add_argument( + "--use_sharding", + type=str2bool, + nargs='?', + const=False, + help="Use sharding Parallelism to training.") + parser.add_argument( + "--sharding_degree", + type=int, + default=1, + help="Sharding degree. Share the parameters to many cards.") + parser.add_argument( + "--dp_degree", type=int, default=1, help="Data Parallelism degree.") + parser.add_argument( + "--mp_degree", + type=int, + default=1, + help="Model Parallelism degree. Spliting the linear layers to many cards." + ) + parser.add_argument( + "--pp_degree", + type=int, + default=1, + help="Pipeline Parallelism degree. Spliting the the model layers to different parts." + ) + parser.add_argument( + "--use_recompute", + type=str2bool, + nargs='?', + const=False, + help="Using the recompute to save the memory.") + + # AMP config + parser.add_argument( + "--use_amp", + type=str2bool, + nargs='?', + const=False, + help="Enable mixed precision training.") + parser.add_argument( + "--enable_addto", + type=str2bool, + nargs='?', + const=True, + help="Whether to enable the addto strategy for gradient accumulation or not. This is only used for AMP training." + ) + parser.add_argument( + "--scale_loss", + type=float, + default=32768, + help="The value of scale_loss for fp16. This is only used for AMP training." + ) + parser.add_argument( + "--hidden_dropout_prob", + type=float, + default=0.1, + help="The hidden dropout prob.") + parser.add_argument( + "--attention_probs_dropout_prob", + type=float, + default=0.1, + help="The attention probs dropout prob.") + # Other config + parser.add_argument( + "--seed", type=int, default=1234, help="Random seed for initialization") + parser.add_argument( + "--check_accuracy", + type=str2bool, + nargs='?', + const=False, + help="Check accuracy for training process.") + parser.add_argument( + "--device", + type=str, + default="gpu", + choices=["cpu", "gpu", "xpu"], + help="select cpu, gpu, xpu devices.") + parser.add_argument( + "--lr_decay_style", + type=str, + default="cosine", + choices=["cosine", "none"], + help="Learning rate decay style.") + parser.add_argument( + '-p', + '--profiler_options', + type=str, + default=None, + help='The option of profiler, which should be in format \"key1=value1;key2=value2;key3=value3\".' + ) + args = parser.parse_args() + args.test_iters = args.eval_iters * 10 + + if args.check_accuracy: + if args.hidden_dropout_prob != 0: + args.hidden_dropout_prob = .0 + logger.warning( + "The hidden_dropout_prob should set to 0 for accuracy checking.") + if args.attention_probs_dropout_prob != 0: + args.attention_probs_dropout_prob = .0 + logger.warning( + "The attention_probs_dropout_prob should set to 0 for accuracy checking." + ) + + logger.info('{:20}:{}'.format("paddle commit id", paddle.version.commit)) + for arg in vars(args): + logger.info('{:20}:{}'.format(arg, getattr(args, arg))) + + return args diff --git a/examples/language_model/gpt-3/static/lr.py b/examples/language_model/gpt-3/static/lr.py deleted file mode 120000 index 09070bccaa33..000000000000 --- a/examples/language_model/gpt-3/static/lr.py +++ /dev/null @@ -1 +0,0 @@ -../../gpt/lr.py \ No newline at end of file diff --git a/examples/language_model/gpt-3/static/lr.py b/examples/language_model/gpt-3/static/lr.py new file mode 100644 index 000000000000..0736246910f0 --- /dev/null +++ b/examples/language_model/gpt-3/static/lr.py @@ -0,0 +1,49 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import math +import numpy +import warnings +from paddle import Tensor +from paddle.optimizer.lr import LRScheduler + + +class CosineAnnealingWithWarmupDecay(LRScheduler): + def __init__(self, + max_lr, + min_lr, + warmup_step, + decay_step, + last_epoch=0, + verbose=False): + + self.decay_step = decay_step + self.warmup_step = warmup_step + self.max_lr = max_lr + self.min_lr = min_lr + super(CosineAnnealingWithWarmupDecay, self).__init__(max_lr, last_epoch, + verbose) + + def get_lr(self): + if self.warmup_step > 0 and self.last_epoch <= self.warmup_step: + return float(self.max_lr) * (self.last_epoch) / self.warmup_step + + if self.last_epoch > self.decay_step: + return self.min_lr + + num_step_ = self.last_epoch - self.warmup_step + decay_step_ = self.decay_step - self.warmup_step + decay_ratio = float(num_step_) / float(decay_step_) + coeff = 0.5 * (math.cos(math.pi * decay_ratio) + 1.0) + return self.min_lr + coeff * (self.max_lr - self.min_lr)