HyperAI Documentation

This tutorial is based on PaddleNLP of Universal Information Extraction Perform named entity recognition tasks. And demonstrate through Annotate a small amount of data Fine tune to achieve rapid improvement in model performance.

note

complete Jupyter notebook, code. Annotated data in https://hyperai.com/console/open-tutorials/containers/lWyxi1DwhJU You can find them all.

Import dependency library

from pprint import pprint
from paddlenlp import Taskflow

Use uie-base Perform named entity recognition

Firstly, use the pre trained model directly uie-base Perform named entity recognition. Don't do any tuning to see the effect.

schema = [
    'place name',
    'name',
    'organization',
    'time',
    'product',
    'Price',
    'weather'
]
ie = Taskflow('information_extraction', schema=schema)
pprint(ie("2K With Gearbox Software announce, 《Little Tina's Wonderland》Will be 6 month 24 Early morning of the day 1 Click login Steam, before  PC The platform is Epic Limited time exclusive. Within a limited period of time. Steam Players can Steam start with《Little Tina's Wonderland》, And in 2022 year 7 month 8 Recently, you can enjoy the Gold Hero Armor Pack."))

[{'product': [{'end': 35,
          'probability': 0.8594067882980987,
          'start': 25,
          'text': '《Little Tina's Wonderland》'}],
  'place name': [{'end': 117,
          'probability': 0.5248250992968906,
          'start': 109,
          'text': 'Little Tina's Wonderland'},
         {'end': 34,
          'probability': 0.3007929716932729,
          'start': 26,
          'text': 'Little Tina's Wonderland'}],
  'time': [{'end': 52,
          'probability': 0.87968346213556,
          'start': 38,
          'text': '6 month 24 Early morning of the day 1 spot'}],
  'organization': [{'end': 93,
          'probability': 0.5977969768231866,
          'start': 88,
          'text': 'Steam'},
         {'end': 2,
          'probability': 0.6914769673274321,
          'start': 0,
          'text': '2K'},
         {'end': 75,
          'probability': 0.5848915911412256,
          'start': 71,
          'text': 'Epic'},
         {'end': 60,
          'probability': 0.5682100157587833,
          'start': 55,
          'text': 'Steam'},
         {'end': 21,
          'probability': 0.679590305138845,
          'start': 5,
          'text': 'Gearbox Software'},
         {'end': 105,
          'probability': 0.4573145431744834,
          'start': 100,
          'text': 'Steam'}]}]

pprint(ie("recently. Quantum computing expert. ACM Winner of the Computing Award Scott Aaronson Announced through a blog post. I will be leaving the University of Texas at Austin this week (UT Austin) a year. And join an artificial intelligence research company OpenAI."))

[{'name': [{'end': 32,
          'probability': 0.4801083732026494,
          'start': 24,
          'text': 'Aaronson'},
         {'end': 23,
          'probability': 0.6648137293130958,
          'start': 18,
          'text': 'Scott'}],
  'time': [{'end': 43,
          'probability': 0.8425767345737043,
          'start': 41,
          'text': 'This week'}],
  'organization': [{'end': 87,
          'probability': 0.5554367836811132,
          'start': 81,
          'text': 'OpenAI'}]}]

Use default model uie-base Perform named entity recognition. The effect is quite good. Most named entities have been identified. But there are still some entities that have not been identified. Partial text misidentification and other issues.such as "Scott Aaronson" Identified as two personal names, such as "University of Texas at Austin " Not recognized.

To improve recognition performance. This tutorial attempts to fine tune the model by annotating a small amount of data.

Data Annotations

This tutorial uses a data annotation platform Label Studio Annotate data. All the work is in an open state "HyperAI working space" Completed in the middle.

Start-up Label Studio

As shown above, stay Jupyter Open the terminal and execute it in the terminal HyperAI-label-studio Can be done on HyperAI Jupyter Workspace Used in China LabelStudio I got it. Then use the command line generated as shown below url start-up Label Studio:

Open the link in the browser. Register an account and log in. You can start using it now.

warning

For different HyperAI Computing power container. The external access links in the red box are different from each other. Directly using the links in this tutorial is invalid. Replace with the link prompted in the terminal.

Annotate data

The specific steps are as follows:

Create Project.
Import data. The data used in this tutorial has been uploaded to this computing power container. That is corpus.txt.
Configure label interface.stay Natural Language Processing Select from the middle Named Entity Recognition Template. Add or modify tags as needed. The entity labels that need to be defined in this tutorial are 'place name' 'name' 'organization' 'time' 'product' 'Price' 'weather' .
Start annotating data.
Export data. After the annotation is completed, from label studio export JSON Format result file. There are already pre labeled files in this computing power container label-studio.json.

info

It's okay if you're too lazy to label it yourself. In this tutorial label-studio.json It is the result that has already been annotated and exported.

Model fine-tuning

The script required for fine-tuning the following model has been uploaded to This computing power container.

Data conversion

Execute the following script in the terminal, take label studio Convert the exported data file format to doccano Export data file format.

python labelstudio2doccano.py --labelstudio_file label-studio.json

Parameter Description:

labelstudio_file: label studio The export file path (Only supported JSON format).
doccano_file: doccano Save path for formatted data files. Default is "doccano_ext.jsonl".
task_type: Task type. Optional extraction ("ext")And classification ("cls")Two types of tasks. Default is "ext".

info

PaddleNLP Default does not provide labelstudio A tool for converting annotation formats to the formats it supports. We have provided one here labelstudio2doccano.py The script.

Then execute the following script in the terminal, yes doccano Processing formatted data files. After execution, it will be /home/data Generate training under directory/verification/Test set file.

python doccano.py \
    --doccano_file ./doccano_ext.jsonl \
    --task_type "ext" \
    --save_dir ./data \
    --splits 0.7 0.2 0.1

Parameter Description:

doccano_file: doccano Format data annotation file path.
task_type: Select task type. Optional extraction ("ext")And classification ("cls")Two types of tasks.
save_dir: Storage directory for training data. Default storage in data Under the directory.
negative_ratio: Maximum negative case ratio. This parameter is only valid for extraction type tasks. Properly constructing negative examples can enhance the effectiveness of the model. The number of negative examples is related to the actual number of labels. Maximum number of negative cases = negative_ratio * Number of positive examples. This parameter is only valid for the training set. Default is 5. To ensure the accuracy of the evaluation indicators. Default construction of all negative examples for validation and test sets.
splits: Dividing the dataset into training sets. Verification set. Proportion of test set. Default is [0.8, 0.1, 0.1] .
options: Specify the category label for the classification task. This parameter is only valid for classification type tasks. Default is ["Positive direction", "Negative direction"].
prompt_prefix: Declaration of classification tasks prompt Prefix Information , This parameter is only valid for classification type tasks. Default is "Emotional inclination".
is_shuffle: Do you want to randomly split the dataset. Default is True.
seed: Random seeds. Default is 1000.
separator: Entity category/The delimiter between evaluation dimension and classification label. This parameter is only applicable to entities/Evaluate the effectiveness of dimension level classification tasks. Default is "##".

warning

Every execution doccano.py script. It will overwrite existing data files with the same name.

Finetune

Execute the following script in the terminal to fine tune the model.

python finetune.py \
    --train_path "./data/train.txt" \
    --dev_path "./data/dev.txt" \
    --save_dir "./checkpoint" \
    --learning_rate 1e-5 \
    --batch_size 4 \
    --max_seq_len 512 \
    --num_epochs 50 \
    --model "uie-base" \
    --seed 1000 \
    --logging_steps 10 \
    --valid_steps 100 \
    --device "gpu"

Parameter Description:

train_path: Training set file path.
dev_path: Verification set file path.
save_dir: Model storage path. Default is "./checkpoint".
learning_rate: Learning rate. Default is 1e-5.
batch_size: Batch processing size. Please adjust according to the machine situation. Default is 16.
max_seq_len: Maximum text segmentation length. When the input exceeds the maximum length, the input text will be automatically split. Default is 512.
num_epochs: Training rounds. Default is 100.
model: Select Model. The program will fine tune the model based on the selected model. Optional "uie-base", "uie-medium", "uie-mini", "uie-micro" and "uie-nano", Default is "uie-base".
seed: Random seeds. Default is 1000.
logging_steps: Interval between log printing steps number. Default is 10.
valid_steps: evaluate The interval steps number. Default is 100.
device: What equipment should be used for training. Optional "cpu" or "gpu".
init_from_ckpt: Initialize the path of model parameters. Training can continue from the breakpoint.

Model evaluation

Execute the following script in the terminal for model evaluation.

python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/dev.txt \
    --batch_size 16 \
    --max_seq_len 512

output:

[2022-07-15 03:18:19,157] [    INFO] - -----------------------------
[2022-07-15 03:18:19,157] [    INFO] - Class Name: all_classes
[2022-07-15 03:18:19,157] [    INFO] - Evaluation Precision: 0.95349 | Recall: 0.89130 | F1: 0.92135

You can see F1 It has already been reached 0.92, The model demonstrates good performance.

Parameter Description:

model_path: Path to the model folder for evaluation. The path must include the model weight file model_state.pdparams And configuration files model_config.json.
test_path: Test set file for evaluation.
batch_size: Batch processing size. Please adjust according to the machine situation. Default is 16.
max_seq_len: Maximum text segmentation length. When the input exceeds the maximum length, the input text will be automatically split. Default is 512.
debug: Is it enabled debug Evaluate each positive example category separately in the pattern. This mode is only used for model debugging. Default shutdown.

debug Example of Mode Output:

[2022-07-15 03:27:57,801] [    INFO] - -----------------------------
[2022-07-15 03:27:57,801] [    INFO] - Class Name: organization
[2022-07-15 03:27:57,802] [    INFO] - Evaluation Precision: 1.00000 | Recall: 0.75000 | F1: 0.85714
[2022-07-15 03:27:57,913] [    INFO] - -----------------------------
[2022-07-15 03:27:57,913] [    INFO] - Class Name: place name
[2022-07-15 03:27:57,913] [    INFO] - Evaluation Precision: 0.90476 | Recall: 0.82609 | F1: 0.86364
[2022-07-15 03:27:58,046] [    INFO] - -----------------------------
[2022-07-15 03:27:58,046] [    INFO] - Class Name: time
[2022-07-15 03:27:58,047] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-15 03:27:58,098] [    INFO] - -----------------------------
[2022-07-15 03:27:58,098] [    INFO] - Class Name: product
[2022-07-15 03:27:58,098] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-15 03:27:58,147] [    INFO] - -----------------------------
[2022-07-15 03:27:58,147] [    INFO] - Class Name: Price
[2022-07-15 03:27:58,147] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
[2022-07-15 03:27:58,176] [    INFO] - -----------------------------
[2022-07-15 03:27:58,176] [    INFO] - Class Name: name
[2022-07-15 03:27:58,177] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000

Fine tuned effect

my_ie = Taskflow("information_extraction", schema=schema, task_path='./checkpoint/model_best')  # task_path Specify the path of the model weight file
pprint(my_ie("2K With Gearbox Software announce, 《Little Tina's Wonderland》Will be 6 month 24 Early morning of the day 1 Click login Steam, before  PC The platform is Epic Limited time exclusive. Within a limited period of time. Steam Players can Steam start with《Little Tina's Wonderland》, And in 2022 year 7 month 8 Recently, you can enjoy the Gold Hero Armor Pack."))

[{'product': [{'end': 148,
          'probability': 0.9977381891196586,
          'start': 141,
          'text': 'Golden Hero Armor Pack'}],
  'time': [{'end': 52,
          'probability': 0.9999856949362851,
          'start': 38,
          'text': '6 month 24 Early morning of the day 1 spot'},
         {'end': 137,
          'probability': 0.6508416072546055,
          'start': 122,
          'text': '2022 year 7 month 8 Recently'}],
  'organization': [{'end': 21,
          'probability': 0.9996073012678011,
          'start': 5,
          'text': 'Gearbox Software'},
         {'end': 93,
          'probability': 0.9872895891306825,
          'start': 88,
          'text': 'Steam'},
         {'end': 105,
          'probability': 0.9665188911951077,
          'start': 100,
          'text': 'Steam'},
         {'end': 2,
          'probability': 0.9883892925330713,
          'start': 0,
          'text': '2K'},
         {'end': 75,
          'probability': 0.9965524822425209,
          'start': 71,
          'text': 'Epic'},
         {'end': 60,
          'probability': 0.9965759490955008,
          'start': 55,
          'text': 'Steam'}]}]

pprint(my_ie("recently. Quantum computing expert. ACM Winner of the Computing Award Scott Aaronson Announced through a blog post. I will be leaving the University of Texas at Austin this week (UT Austin) a year. And join an artificial intelligence research company OpenAI."))

[{'name': [{'end': 32,
          'probability': 0.9999316942434575,
          'start': 18,
          'text': 'Scott Aaronson'}],
  'place name': [{'end': 54,
          'probability': 0.976469583224933,
          'start': 51,
          'text': 'Austin'}],
  'time': [{'end': 69,
          'probability': 0.9782005099896942,
          'start': 67,
          'text': 'a year'},
         {'end': 2,
          'probability': 0.9995077236474508,
          'start': 0,
          'text': 'recently'},
         {'end': 43,
          'probability': 0.9999382505043286,
          'start': 41,
          'text': 'This week'}],
  'organization': [{'end': 66,
          'probability': 0.46570937436359827,
          'start': 57,
          'text': 'UT Austin'},
         {'end': 56,
          'probability': 0.9686587700987381,
          'start': 45,
          'text': 'University of Texas at Austin '},
         {'end': 13,
          'probability': 0.7166219551892539,
          'start': 10,
          'text': 'ACM'},
         {'end': 87,
          'probability': 0.999835617128781,
          'start': 81,
          'text': 'OpenAI'}]}]

Model deployment

After obtaining the fine tuned model. The model can be deployed to HyperAI The server. To achieve real-time model inference services.

For more information about model deployment, please refer to Introduction to Model Deployment and Model deployment of Chinese named entity recognition based on transfer learning.

Serving Service Writing

to write predictor.py file:

Import dependency library: In addition to the libraries used in the business. Additional dependencies are required HyperAI-serving.

import HyperAI_serving as serv
from paddlenlp import Taskflow

Post processing (Optional): Process the results returned by the model as needed. To better showcase. In this tutorial, through format() Functions and add_o() Change the form of named entity recognition results for function modification.
**Predictor class: ** No need to inherit from other classes. But at least it needs to be provided __init__ and predict Two interfaces.
- stay __init__ Define entity extraction structure in the middle, adopt Taskflow Load model.
- stay predict Make predictions in the middle. Return the result of post-processing.

class Predictor:
    def __init__(self):
        self.schema = ['place name', 'name', 'organization', 'time', 'product', 'Price', 'weather']
        self.ie = Taskflow("information_extraction", schema=self.schema, task_path='./checkpoint/model_best')


    def predict(self, json):
        text = json["input"]
        uie = self.ie(text)[0]
        result = format(text, uie)
        return result

function: Start service.

if __name__ == '__main__':
    serv.run(Predictor)

info

The pre written version is already available in the root directory of the tutorial predictor.py Can be used directly in the future.

Stay Jupyter Mid Test

Execute in the terminal OPENBAYES_JOB_URL= python predictor.py , After successfully starting the local testing service. Here it is Notebook Execute the following code for testing.

import requests
text = {
  "input": "recently. Quantum computing expert. ACM Winner of the Computing Award Scott Aaronson Announced through a blog post. I will be leaving the University of Texas at Austin this week (UT Austin) a year. And join an artificial intelligence research company OpenAI."
}
result = requests.post('http://localhost:25252', json=text)
result.json()

[{'entity_group': 'TIME',
  'score': 0.9995077236474508,
  'start': 0,
  'end': 2,
  'word': 'recently'},
 {'entity_group': 'O',
  'score': None,
  'start': 2,
  'end': 10,
  'word': ', Quantum computing expert, '},
 {'entity_group': 'ORG',
  'score': 0.7166219551892539,
  'start': 10,
  'end': 13,
  'word': 'ACM'},
 {'entity_group': 'O', 'score': None, 'start': 13, 'end': 18, 'word': 'Winner of the Computing Award'},
 {'entity_group': 'PER',
  'score': 0.9999316942434575,
  'start': 18,
  'end': 32,
  'word': 'Scott Aaronson'},
 {'entity_group': 'O',
  'score': None,
  'start': 32,
  'end': 41,
  'word': 'Announced through a blog post. Will be'},
 {'entity_group': 'TIME',
  'score': 0.9999382505043286,
  'start': 41,
  'end': 43,
  'word': 'This week'},
 {'entity_group': 'O', 'score': None, 'start': 43, 'end': 45, 'word': 'leave'},
 {'entity_group': 'ORG',
  'score': 0.9686587700987381,
  'start': 45,
  'end': 56,
  'word': 'University of Texas at Austin '},
 {'entity_group': 'LOC',
  'score': 0.976469583224933,
  'start': 51,
  'end': 54,
  'word': 'Austin'},
 {'entity_group': 'O', 'score': None, 'start': 56, 'end': 57, 'word': '('},
 {'entity_group': 'ORG',
  'score': 0.46570937436359827,
  'start': 57,
  'end': 66,
  'word': 'UT Austin'},
 {'entity_group': 'O', 'score': None, 'start': 66, 'end': 67, 'word': ')'},
 {'entity_group': 'TIME',
  'score': 0.9782005099896942,
  'start': 67,
  'end': 69,
  'word': 'a year'},
 {'entity_group': 'O',
  'score': None,
  'start': 69,
  'end': 81,
  'word': ', And join an artificial intelligence research company'},
 {'entity_group': 'ORG',
  'score': 0.999835617128781,
  'start': 81,
  'end': 87,
  'word': 'OpenAI'},
 {'entity_group': 'O', 'score': None, 'start': 87, 'end': 88, 'word': '.'}]

Deploy

After successful testing. Stop this computing power container. Waiting for data synchronization to complete.

stay 'Computing power container' - 'Model deployment' Click in the middle 'Create a new deployment' , Choose the same image as during development. Bind this computing power container, click 'deploy' , You can conduct online testing now.

Import dependency library​

Use uie-base Perform named entity recognition​

Data Annotations​

Start-up Label Studio​

Annotate data​

Model fine-tuning​

Data conversion​

Finetune​

Model evaluation​

Fine tuned effect​

Model deployment​

Serving Service Writing​

Stay Jupyter Mid Test​

Deploy​

Test deployment​

Import dependency library

Use uie-base Perform named entity recognition

Data Annotations

Start-up Label Studio

Annotate data

Model fine-tuning

Data conversion

Finetune

Model evaluation

Fine tuned effect

Model deployment

Serving Service Writing

Stay Jupyter Mid Test

Deploy

Test deployment