This tutorial is based on PaddleNLP of Universal Information Extraction Perform named entity recognition tasks. And demonstrate through Annotate a small amount of data Fine tune to achieve rapid improvement in model performance.
complete Jupyter notebook, code. Annotated data in https://hyperai.com/console/open-tutorials/containers/lWyxi1DwhJU You can find them all.
Import dependency library
from pprint import pprint
from paddlenlp import Taskflow
Use uie-base Perform named entity recognition
Firstly, use the pre trained model directly uie-base Perform named entity recognition. Don't do any tuning to see the effect.
schema = [
'place name',
'name',
'organization',
'time',
'product',
'Price',
'weather'
]
ie = Taskflow('information_extraction', schema=schema)
pprint(ie("2K With Gearbox Software announce, 《Little Tina's Wonderland》Will be 6 month 24 Early morning of the day 1 Click login Steam, before PC The platform is Epic Limited time exclusive. Within a limited period of time. Steam Players can Steam start with《Little Tina's Wonderland》, And in 2022 year 7 month 8 Recently, you can enjoy the Gold Hero Armor Pack."))
[{'product': [{'end': 35,
'probability': 0.8594067882980987,
'start': 25,
'text': '《Little Tina's Wonderland》'}],
'place name': [{'end': 117,
'probability': 0.5248250992968906,
'start': 109,
'text': 'Little Tina's Wonderland'},
{'end': 34,
'probability': 0.3007929716932729,
'start': 26,
'text': 'Little Tina's Wonderland'}],
'time': [{'end': 52,
'probability': 0.87968346213556,
'start': 38,
'text': '6 month 24 Early morning of the day 1 spot'}],
'organization': [{'end': 93,
'probability': 0.5977969768231866,
'start': 88,
'text': 'Steam'},
{'end': 2,
'probability': 0.6914769673274321,
'start': 0,
'text': '2K'},
{'end': 75,
'probability': 0.5848915911412256,
'start': 71,
'text': 'Epic'},
{'end': 60,
'probability': 0.5682100157587833,
'start': 55,
'text': 'Steam'},
{'end': 21,
'probability': 0.679590305138845,
'start': 5,
'text': 'Gearbox Software'},
{'end': 105,
'probability': 0.4573145431744834,
'start': 100,
'text': 'Steam'}]}]
pprint(ie("recently. Quantum computing expert. ACM Winner of the Computing Award Scott Aaronson Announced through a blog post. I will be leaving the University of Texas at Austin this week (UT Austin) a year. And join an artificial intelligence research company OpenAI."))
[{'name': [{'end': 32,
'probability': 0.4801083732026494,
'start': 24,
'text': 'Aaronson'},
{'end': 23,
'probability': 0.6648137293130958,
'start': 18,
'text': 'Scott'}],
'time': [{'end': 43,
'probability': 0.8425767345737043,
'start': 41,
'text': 'This week'}],
'organization': [{'end': 87,
'probability': 0.5554367836811132,
'start': 81,
'text': 'OpenAI'}]}]
Use default model uie-base Perform named entity recognition. The effect is quite good. Most named entities have been identified. But there are still some entities that have not been identified. Partial text misidentification and other issues.such as "Scott Aaronson" Identified as two personal names, such as "University of Texas at Austin " Not recognized.
To improve recognition performance. This tutorial attempts to fine tune the model by annotating a small amount of data.
Data Annotations
This tutorial uses a data annotation platform Label Studio Annotate data. All the work is in an open state "HyperAI working space" Completed in the middle.
Start-up Label Studio
As shown above, stay Jupyter Open the terminal and execute it in the terminal HyperAI-label-studio
Can be done on HyperAI Jupyter Workspace Used in China LabelStudio I got it. Then use the command line generated as shown below url start-up Label Studio:
Open the link in the browser. Register an account and log in. You can start using it now.
For different HyperAI Computing power container. The external access links in the red box are different from each other. Directly using the links in this tutorial is invalid. Replace with the link prompted in the terminal.
Annotate data
The specific steps are as follows:
-
Create Project.
-
Import data. The data used in this tutorial has been uploaded to this computing power container. That is
corpus.txt
. -
Configure label interface.stay Natural Language Processing Select from the middle Named Entity Recognition Template. Add or modify tags as needed. The entity labels that need to be defined in this tutorial are 'place name' 'name' 'organization' 'time' 'product' 'Price' 'weather' .
-
Start annotating data.
-
Export data. After the annotation is completed, from label studio export JSON Format result file. There are already pre labeled files in this computing power container
label-studio.json
.
It's okay if you're too lazy to label it yourself. In this tutorial label-studio.json
It is the result that has already been annotated and exported.
Model fine-tuning
The script required for fine-tuning the following model has been uploaded to This computing power container.
Data conversion
Execute the following script in the terminal, take label studio Convert the exported data file format to doccano Export data file format.
python labelstudio2doccano.py --labelstudio_file label-studio.json
Parameter Description:
labelstudio_file
: label studio The export file path (Only supported JSON format).doccano_file
: doccano Save path for formatted data files. Default is "doccano_ext.jsonl".task_type
: Task type. Optional extraction ("ext")And classification ("cls")Two types of tasks. Default is "ext".
PaddleNLP Default does not provide labelstudio A tool for converting annotation formats to the formats it supports. We have provided one here labelstudio2doccano.py
The script.
Then execute the following script in the terminal, yes doccano Processing formatted data files. After execution, it will be /home/data Generate training under directory/verification/Test set file.
python doccano.py \
--doccano_file ./doccano_ext.jsonl \
--task_type "ext" \
--save_dir ./data \
--splits 0.7 0.2 0.1
Parameter Description:
doccano_file
: doccano Format data annotation file path.task_type
: Select task type. Optional extraction ("ext")And classification ("cls")Two types of tasks.save_dir
: Storage directory for training data. Default storage in data Under the directory.negative_ratio
: Maximum negative case ratio. This parameter is only valid for extraction type tasks. Properly constructing negative examples can enhance the effectiveness of the model. The number of negative examples is related to the actual number of labels. Maximum number of negative cases = negative_ratio * Number of positive examples. This parameter is only valid for the training set. Default is 5. To ensure the accuracy of the evaluation indicators. Default construction of all negative examples for validation and test sets.splits
: Dividing the dataset into training sets. Verification set. Proportion of test set. Default is [0.8, 0.1, 0.1] .options
: Specify the category label for the classification task. This parameter is only valid for classification type tasks. Default is ["Positive direction", "Negative direction"].prompt_prefix
: Declaration of classification tasks prompt Prefix Information , This parameter is only valid for classification type tasks. Default is "Emotional inclination".is_shuffle
: Do you want to randomly split the dataset. Default is True.seed
: Random seeds. Default is 1000.separator
: Entity category/The delimiter between evaluation dimension and classification label. This parameter is only applicable to entities/Evaluate the effectiveness of dimension level classification tasks. Default is "##".
Every execution doccano.py script. It will overwrite existing data files with the same name.
Finetune
Execute the following script in the terminal to fine tune the model.
python finetune.py \
--train_path "./data/train.txt" \
--dev_path "./data/dev.txt" \
--save_dir "./checkpoint" \
--learning_rate 1e-5 \
--batch_size 4 \
--max_seq_len 512 \
--num_epochs 50 \
--model "uie-base" \
--seed 1000 \
--logging_steps 10 \
--valid_steps 100 \
--device "gpu"
Parameter Description:
train_path
: Training set file path.dev_path
: Verification set file path.save_dir
: Model storage path. Default is "./checkpoint".learning_rate
: Learning rate. Default is 1e-5.batch_size
: Batch processing size. Please adjust according to the machine situation. Default is 16.max_seq_len
: Maximum text segmentation length. When the input exceeds the maximum length, the input text will be automatically split. Default is 512.num_epochs
: Training rounds. Default is 100.model
: Select Model. The program will fine tune the model based on the selected model. Optional "uie-base", "uie-medium", "uie-mini", "uie-micro" and "uie-nano", Default is "uie-base".seed
: Random seeds. Default is 1000.logging_steps
: Interval between log printing steps number. Default is 10.valid_steps
: evaluate The interval steps number. Default is 100.device
: What equipment should be used for training. Optional "cpu" or "gpu".init_from_ckpt
: Initialize the path of model parameters. Training can continue from the breakpoint.
Model evaluation
Execute the following script in the terminal for model evaluation.
python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/dev.txt \
--batch_size 16 \
--max_seq_len 512
output:
[2022-07-15 03:18:19,157] [ INFO] - -----------------------------
[2022-07-15 03:18:19,157] [ INFO] - Class Name: all_classes
[2022-07-15 03:18:19,157] [ INFO] - Evaluation Precision: 0.95349 | Recall: 0.89130 | F1: 0.92135
You can see F1 It has already been reached 0.92
, The model demonstrates good performance.
Parameter Description:
model_path
: Path to the model folder for evaluation. The path must include the model weight file model_state.pdparams And configuration files model_config.json.test_path
: Test set file for evaluation.batch_size
: Batch processing size. Please adjust according to the machine situation. Default is 16.max_seq_len
: Maximum text segmentation length. When the input exceeds the maximum length, the input text will be automatically split. Default is 512.debug
: Is it enabled debug Evaluate each positive example category separately in the pattern. This mode is only used for model debugging. Default shutdown.
debug
Example of Mode Output:
[2022-07-15 03:27:57,801] [ INFO] - -----------------------------
[2022-07-15 03:27:57,801] [ INFO] - Class Name: organization
[2022-07-15 03:27:57,802] [ INFO] - Evaluation Precision: 1.00000 | Recall: 0.75000 | F1: 0.85714
[2022-07-15 03:27:57,913] [ INFO] - -----------------------------
[2022-07-15 03:27:57,913] [ INFO] - Class Name: place name
[2022-07-15 03:27:57,913] [ INFO] - Evaluation Precision: 0.90476 | Recall: 0.82609 | F1: 0.86364
[2022-07-15 03:27:58,046] [ INFO] - -----------------------------
[2022-07-15 03:27:58,046] [ INFO] - Class Name: time
[2022-07-15 03:27:58,047] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-15 03:27:58,098] [ INFO] - -----------------------------
[2022-07-15 03:27:58,098] [ INFO] - Class Name: product
[2022-07-15 03:27:58,098] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-15 03:27:58,147] [ INFO] - -----------------------------
[2022-07-15 03:27:58,147] [ INFO] - Class Name: Price
[2022-07-15 03:27:58,147] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-15 03:27:58,176] [ INFO] - -----------------------------
[2022-07-15 03:27:58,176] [ INFO] - Class Name: name
[2022-07-15 03:27:58,177] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
Fine tuned effect
my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best') # task_path Specify the path of the model weight file
pprint(my_ie("2K With Gearbox Software announce, 《Little Tina's Wonderland》Will be 6 month 24 Early morning of the day 1 Click login Steam, before PC The platform is Epic Limited time exclusive. Within a limited period of time. Steam Players can Steam start with《Little Tina's Wonderland》, And in 2022 year 7 month 8 Recently, you can enjoy the Gold Hero Armor Pack."))
[{'product': [{'end': 148,
'probability': 0.9977381891196586,
'start': 141,
'text': 'Golden Hero Armor Pack'}],
'time': [{'end': 52,
'probability': 0.9999856949362851,
'start': 38,
'text': '6 month 24 Early morning of the day 1 spot'},
{'end': 137,
'probability': 0.6508416072546055,
'start': 122,
'text': '2022 year 7 month 8 Recently'}],
'organization': [{'end': 21,
'probability': 0.9996073012678011,
'start': 5,
'text': 'Gearbox Software'},
{'end': 93,
'probability': 0.9872895891306825,
'start': 88,
'text': 'Steam'},
{'end': 105,
'probability': 0.9665188911951077,
'start': 100,
'text': 'Steam'},
{'end': 2,
'probability': 0.9883892925330713,
'start': 0,
'text': '2K'},
{'end': 75,
'probability': 0.9965524822425209,
'start': 71,
'text': 'Epic'},
{'end': 60,
'probability': 0.9965759490955008,
'start': 55,
'text': 'Steam'}]}]
pprint(my_ie("recently. Quantum computing expert. ACM Winner of the Computing Award Scott Aaronson Announced through a blog post. I will be leaving the University of Texas at Austin this week (UT Austin) a year. And join an artificial intelligence research company OpenAI."))
[{'name': [{'end': 32,
'probability': 0.9999316942434575,
'start': 18,
'text': 'Scott Aaronson'}],
'place name': [{'end': 54,
'probability': 0.976469583224933,
'start': 51,
'text': 'Austin'}],
'time': [{'end': 69,
'probability': 0.9782005099896942,
'start': 67,
'text': 'a year'},
{'end': 2,
'probability': 0.9995077236474508,
'start': 0,
'text': 'recently'},
{'end': 43,
'probability': 0.9999382505043286,
'start': 41,
'text': 'This week'}],
'organization': [{'end': 66,
'probability': 0.46570937436359827,
'start': 57,
'text': 'UT Austin'},
{'end': 56,
'probability': 0.9686587700987381,
'start': 45,
'text': 'University of Texas at Austin '},
{'end': 13,
'probability': 0.7166219551892539,
'start': 10,
'text': 'ACM'},
{'end': 87,
'probability': 0.999835617128781,
'start': 81,
'text': 'OpenAI'}]}]
Model deployment
After obtaining the fine tuned model. The model can be deployed to HyperAI The server. To achieve real-time model inference services.
For more information about model deployment, please refer to Introduction to Model Deployment and Model deployment of Chinese named entity recognition based on transfer learning.
Serving Service Writing
to write predictor.py
file:
- Import dependency library: In addition to the libraries used in the business. Additional dependencies are required HyperAI-serving.
import HyperAI_serving as serv
from paddlenlp import Taskflow
-
Post processing (Optional): Process the results returned by the model as needed. To better showcase. In this tutorial, through
format()
Functions andadd_o()
Change the form of named entity recognition results for function modification. -
**Predictor class: ** No need to inherit from other classes. But at least it needs to be provided
__init__
andpredict
Two interfaces.- stay
__init__
Define entity extraction structure in the middle, adoptTaskflow
Load model. - stay
predict
Make predictions in the middle. Return the result of post-processing.
- stay
class Predictor:
def __init__(self):
self.schema = ['place name', 'name', 'organization', 'time', 'product', 'Price', 'weather']
self.ie = Taskflow("information_extraction", schema=self.schema, task_path='./checkpoint/model_best')
def predict(self, json):
text = json["input"]
uie = self.ie(text)[0]
result = format(text, uie)
return result
- function: Start service.
if __name__ == '__main__':
serv.run(Predictor)
The pre written version is already available in the root directory of the tutorial predictor.py
Can be used directly in the future.
Stay Jupyter Mid Test
Execute in the terminal OPENBAYES_JOB_URL= python predictor.py
, After successfully starting the local testing service. Here it is Notebook Execute the following code for testing.
import requests
text = {
"input": "recently. Quantum computing expert. ACM Winner of the Computing Award Scott Aaronson Announced through a blog post. I will be leaving the University of Texas at Austin this week (UT Austin) a year. And join an artificial intelligence research company OpenAI."
}
result = requests.post('http://localhost:25252', json=text)
result.json()
[{'entity_group': 'TIME',
'score': 0.9995077236474508,
'start': 0,
'end': 2,
'word': 'recently'},
{'entity_group': 'O',
'score': None,
'start': 2,
'end': 10,
'word': ', Quantum computing expert, '},
{'entity_group': 'ORG',
'score': 0.7166219551892539,
'start': 10,
'end': 13,
'word': 'ACM'},
{'entity_group': 'O', 'score': None, 'start': 13, 'end': 18, 'word': 'Winner of the Computing Award'},
{'entity_group': 'PER',
'score': 0.9999316942434575,
'start': 18,
'end': 32,
'word': 'Scott Aaronson'},
{'entity_group': 'O',
'score': None,
'start': 32,
'end': 41,
'word': 'Announced through a blog post. Will be'},
{'entity_group': 'TIME',
'score': 0.9999382505043286,
'start': 41,
'end': 43,
'word': 'This week'},
{'entity_group': 'O', 'score': None, 'start': 43, 'end': 45, 'word': 'leave'},
{'entity_group': 'ORG',
'score': 0.9686587700987381,
'start': 45,
'end': 56,
'word': 'University of Texas at Austin '},
{'entity_group': 'LOC',
'score': 0.976469583224933,
'start': 51,
'end': 54,
'word': 'Austin'},
{'entity_group': 'O', 'score': None, 'start': 56, 'end': 57, 'word': '('},
{'entity_group': 'ORG',
'score': 0.46570937436359827,
'start': 57,
'end': 66,
'word': 'UT Austin'},
{'entity_group': 'O', 'score': None, 'start': 66, 'end': 67, 'word': ')'},
{'entity_group': 'TIME',
'score': 0.9782005099896942,
'start': 67,
'end': 69,
'word': 'a year'},
{'entity_group': 'O',
'score': None,
'start': 69,
'end': 81,
'word': ', And join an artificial intelligence research company'},
{'entity_group': 'ORG',
'score': 0.999835617128781,
'start': 81,
'end': 87,
'word': 'OpenAI'},
{'entity_group': 'O', 'score': None, 'start': 87, 'end': 88, 'word': '.'}]
Deploy
After successful testing. Stop this computing power container. Waiting for data synchronization to complete.
stay 'Computing power container' - 'Model deployment' Click in the middle 'Create a new deployment' , Choose the same image as during development. Bind this computing power container, click 'deploy' , You can conduct online testing now.