Add model LUKE #1677

Beacontownfc · 2022-02-15T03:24:28Z

Description
Add new model LUKE
The model weight:
链接：https://pan.baidu.com/s/17aC-27kjJdEaGT6nZt5T_Q
提取码：i4p2

Steffy-zxf · 2022-02-21T08:09:50Z

examples/language_model/luke/create_squad_data.py

@@ -0,0 +1,30 @@
+#encoding=utf8


目前PaddleNLP支持python3.6以上版本，无须指定编码

Steffy-zxf · 2022-02-21T08:09:58Z

examples/language_model/luke/run_open_entity.py

@@ -0,0 +1,237 @@
+# encoding=utf8


Steffy-zxf · 2022-02-21T08:11:43Z

examples/language_model/luke/run_open_entity.py

+
+parser = argparse.ArgumentParser(description="LUKE FOR OPEN ENTITY")
+
+parser.add_argument("--output_dir", type=str, required=True)


args请天假每个argument的description

Steffy-zxf · 2022-02-21T08:15:22Z

examples/language_model/luke/run_open_entity.py

+parser.add_argument("--max_mention_length", type=str, default=30)
+
+args = parser.parse_args()
+args.tokenizer = LukeTokenizer.from_pretrained(args.model_type)


tokenizer一定要作为全局变量使用吗？

Steffy-zxf · 2022-02-21T08:19:32Z

examples/language_model/luke/run_open_entity.py

+            f.entity_attention_mask for f in features
+        ]
+        self.all_labels = [f.labels for f in features]
+


这个数据集多大？
不建议一次性将所有数据集加载进来，这样有可能造成占用内存溢出。
可以通过继承自paddle.io.Dataset, 以迭代的方式返回数据。

两个数据集，open_entity数据集不到1MB，SQuAD1.1数据集不到30MB.

已根据您的建议修改

Steffy-zxf · 2022-02-21T08:20:42Z

examples/language_model/luke/run_open_entity.py

+                    max_len - len(each_batch[k])))
+            return np.array(new_data, dtype='int64')
+
+        return (


给下注释，解释189 - 196 行代码的意义。

Steffy-zxf · 2022-02-21T08:21:17Z

examples/language_model/luke/run_squad.py

+
+class DataGenerator(Dataset):
+    def __init__(self, features, args):
+        super(DataGenerator, self).__init__()


Steffy-zxf · 2022-02-21T08:21:24Z

examples/language_model/luke/run_squad.py

@@ -0,0 +1,187 @@
+# encoding=utf8


Steffy-zxf · 2022-02-21T08:23:28Z

examples/language_model/luke/run_squad.py

+def load_examples(args, evaluate=False):
+    args.evaluate = evaluate
+    features = []
+    if not evaluate:


建议增加data file参数，控制加载的数据集。省去if-else的判断。

Steffy-zxf · 2022-02-21T08:25:39Z

examples/language_model/luke/utils/reading_comprehension/dataset.py

+            input_data = json.load(reader)["data"]
+        return self._create_examples(input_data)
+
+    # def __init__(self, qas_id, title, question_text, context_text, answers, is_impossible=False):


删去无用注释

examples/language_model/luke/run_open_entity.py

Steffy-zxf · 2022-02-24T07:37:30Z

examples/language_model/luke/run_open_entity.py

+parser.add_argument(
+    "--data_dir", type=str, required=True, help="Dataset folder")
+parser.add_argument(
+    "--eval_batch_size",


建议统一用batch_size，不用区分eval_batch_size 和 train_batch_size

Steffy-zxf · 2022-02-24T07:43:08Z

examples/language_model/luke/utils/feature.py

+        all_mentions = mentions_a + mentions_b
+        if all_mentions:
+            print(all_mentions)
+            exit()


192 - 194 行代码作用？

此代码是LUKE模型作者提供的实体检测代码，我们使用他的代码检测实体，但在SQuAD1.1数据集上并没有检测到实体，我们已经把实体检测相关代码删除。

Steffy-zxf · 2022-02-24T07:48:11Z

examples/language_model/luke/utils/squad_processor.py

+        min_mention_link_prob=args.min_mention_link_prob,
+        segment_b_id=0,
+        add_extra_sep_token=True,
+        is_training='train' in data_file)


以上参数是否还有简化的空间呢？可以看看是否有默认参数可以省去。目前API参数量太多了，可读性差。

代码已优化

Steffy-zxf · 2022-03-07T01:47:07Z

examples/language_model/luke/run_squad.py

+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    #NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is


Steffy-zxf · 2022-03-07T01:48:17Z

examples/language_model/luke/run_squad.py

+    # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results
+    # in one example possible giving several features when a context is long, each of those features having a
+    # context that overlaps a bit the context of the previous feature.
+    #NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is


Steffy-zxf · 2022-03-07T01:51:14Z

examples/language_model/luke/run_squad.py

+    set_seed(args)
+    if rank == 0:
+        if os.path.exists(args.model_name_or_path):
+            print("init checkpoint from %s" % args.model_name_or_path)


"Loads checkpoints from %s."

Steffy-zxf · 2022-03-07T01:51:55Z

examples/language_model/luke/run_squad.py

+                input_ids, token_type_ids, attention_mask, start_positions, end_positions = batch
+                logits = model(
+                    input_ids=input_ids,
+                    attention_mask=attention_mask, )


去掉末尾逗号

Steffy-zxf · 2022-03-07T02:00:55Z

examples/language_model/luke/run_squad.py

+        dev_batchify_fn = lambda samples, fn=Dict({
+            "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
+            "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id),
+            "attention_mask": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), }): fn(samples)


attention_mask是否一定需要传入指定吗？比如BERT可以根据pad token id确定是否attention_mask

无需指定，已删除attention_mask

Steffy-zxf · 2022-03-07T02:12:11Z

examples/language_model/luke/trainer.py

@@ -0,0 +1,113 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.


这个trainer.py的作用是？似乎没看见用到。

用于在open entity数据集上训练，我已将其与run_open_entity.py合并

Beacontownfc and others added 4 commits February 15, 2022 11:21

add luke

9b7f955

update example

df5981f

Update README.md

f7ec4f4

Update README.md

23fcc3f

chenxiaozeng requested a review from Steffy-zxf February 16, 2022 02:34

Beacontownfc and others added 5 commits February 17, 2022 20:08

normalized code

8e3c285

fix error

8d492e5

add doc string

ced44a6

Update modeling.py

0f729ef

delete entity_vocab.py

abd989c

ZeyuChen added the contributions label Feb 20, 2022

Steffy-zxf requested changes Feb 21, 2022

View reviewed changes

Beacontownfc added 3 commits February 21, 2022 23:33

Modify according to Steffy

3e2a455

fix format

ce88391

fix format

8bf00b9

Beacontownfc requested a review from Steffy-zxf February 22, 2022 02:02

Steffy-zxf requested changes Feb 24, 2022

View reviewed changes

update luke

5ff11c9

Beacontownfc requested a review from Steffy-zxf February 27, 2022 10:08

Beacontownfc added 2 commits February 28, 2022 08:59

Update modeling.py

01eb3c2

Update run_squad.py

308cf45

Steffy-zxf requested changes Mar 7, 2022

View reviewed changes

Beacontownfc added 2 commits March 7, 2022 13:00

update luke

a368ee3

update luke

2923237

Steffy-zxf approved these changes Mar 9, 2022

View reviewed changes

Merge branch 'develop' into luke

9f498c6

yingyibiao merged commit 30ab253 into PaddlePaddle:develop Mar 10, 2022

Beacontownfc deleted the luke branch March 10, 2022 23:51

Beacontownfc restored the luke branch March 10, 2022 23:51

guoshengCS mentioned this pull request Apr 29, 2022

PaddleNLP v2.3rc Release Note Candidate #2031

Closed


		parser = argparse.ArgumentParser(description="LUKE FOR OPEN ENTITY")

		parser.add_argument("--output_dir", type=str, required=True)

		@@ -0,0 +1,113 @@
		# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.

Add model LUKE #1677

Add model LUKE #1677

Conversation

Beacontownfc commented Feb 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Beacontownfc Feb 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Beacontownfc commented Feb 15, 2022 •

edited

Loading

Beacontownfc Feb 21, 2022 •

edited

Loading