Note
Currently, this repository contains only the Pap2Pat dataset. All code for the dataset creation, outline-guided generation and evaluation will be added later.
Dealing with long and highly complex technical text is a challenge for Large Language Models (LLMs), which still have to unfold their potential in supporting expensive and timeintensive processes like patent drafting. Within patents, the description constitutes more than 90% of the document on average. Yet, its automatic generation remains understudied. When drafting patent applications, patent attorneys typically receive invention reports (IRs), which are usually confidential, hindering research on LLM-supported patent drafting. Often, prepublication research papers serve as IRs. We leverage this duality to build PAP2PAT, an open and realistic benchmark for patent drafting consisting of 1.8k patent-paper pairs describing the same inventions. To address the complex longdocument patent generation task, we propose chunk-based outline-guided generation using the research paper as invention specification. Our extensive evaluation using PAP2PAT and a human case study show that LLMs can effectively leverage information from the paper, but still struggle to provide the necessary level of detail. Fine-tuning leads to more patent-style language, but also to more hallucination. We release our data and code.
This repository comprises three main parts:
- Pap2Pat: Dataset and evaluation code
- Outline_Guided_Generation: Implementation of chunk-based outline-guided generation
- Pap2Pat_Dataset_Creation: Code for the creation of Pap2Pat
The code is released under the MIT license, see LICENSE.
The data is released under CC-BY.