Skip to main content

This tutorial is on be based on UIE Named Entity Recognition On the basis of. Further integration Label Studio of Machine Learning Backend Implement interactive pre annotation and model training functions.

Environmental preparation

  1. stay HyperAI Start a "model training" The container. Environmental selection paddlepaddle-2.3 Resource selection vgpu Or other GPU container

  2. stay Jupyter Open one in the middle Terminal window. Then execute the command HyperAI-label-studio start-up label-studio

    Open the link in the red box in the browser. Register an account and log in

  3. Open another one Terminal Execute the following command on the window, install label_studio_ml

    pip install label_studio_ml
    pip uninstall attr

Machine Learning Backend to write

complete Machine Learning Backend see my_ml_backend.py file. For more information on custom machine learning backend development, please refer to Write your own ML backend.

Simply put, my_ml_backend.py It mainly includes an inheritance from LabelStudioMLBase The class. Its content can be divided into the following three main parts:

  1. __init__ method. Includes loading of models and initialization of basic configurations
  2. predict method. Used to generate new prediction results for annotated data. Its key parameters tasks namely label studio The raw data transmitted
  3. fit method. Used for model training. When clicking on the page Train When pressing the button. This method will be called (The specific location will be mentioned in the following text), Its key parameters annotations namely label studio The annotated data that has been transmitted

__init__ Initialization method

Stay __init__ Define and initialize the required variables in the method.LabelStudioMLBase The class provides the following Several special variables available for use:

  • self.label_config: Original label configuration.
  • self.parsed_label_config: Provide structured solutions for the project Label Studio Label configuration.
  • self.train_output: Include the results of previous model training runs. Defined in the training call section fit() The output of the method is the same.

As shown in the examples in this tutorial. Label configuration is:

<View>
<Labels name="label" toName="text">
<Label value="place name" background="#FFA39E"/>
<Label value="name" background="#D4380D"/>
<Label value="organization" background="#FFC069"/>
<Label value="time" background="#AD8B00"/>
<Label value="product" background="#D3F261"/>
<Label value="Price" background="#389E0D"/>
<Label value="weather" background="#5CDBD3"/>
</Labels>
<Text name="text" value="$text"/>
</View>

Correspondingly parsed_label_config As shown below:

{
'label': {
'type': 'Labels',
'to_name': ['text'],
'inputs': [{
'type': 'Text',
'value': 'text'
}],
'labels': ['place name', 'name', 'organization', 'time', 'product', 'Price', 'weather'],
'labels_attrs': {
'place name': {
'value': 'place name',
'background': '#FFA39E'
},
'name': {
'value': 'name',
'background': '#D4380D'
},
'organization': {
'value': 'organization',
'background': '#FFC069'
},
'time': {
'value': 'time',
'background': '#AD8B00'
},
'product': {
'value': 'product',
'background': '#D3F261'
},
'Price': {
'value': 'Price',
'background': '#389E0D'
},
'weather': {
'value': 'weather',
'background': '#5CDBD3'
}
}
}
}

According to needs, from self.parsed_label_config Extract the required information from variables. And through PaddleNLP of Taskflow Load the model for pre annotation.

def __init__(self, **kwargs):
# don't forget to initialize base class...
super(MyModel, self).__init__(**kwargs)

# print("parsed_label_config:", self.parsed_label_config)
self.from_name, self.info = list(self.parsed_label_config.items())[0]

assert self.info['type'] == 'Labels'
assert self.info['inputs'][0]['type'] == 'Text'

self.to_name = self.info['to_name'][0]
self.value = self.info['inputs'][0]['value']
self.labels = list(self.info['labels'])
# init uie model
self.model = Taskflow("information_extraction", schema=self.labels, task_path= './checkpoint/model_best')

predict Prediction method

Write code coverage predict(tasks, **kwargs) method.predict() Method acceptance JSON The format Label Studio task And with Label Studio Accepted format Return prediction.in addition. It can also include and customize predictive scores that can be used for active learning loops.

tasks The parameters contain detailed information about the task to be pre annotated.concrete task The format is as follows:

{
'id': 16,
'data': {
'text': 'Xinhua News Agency Dublin 6 month 28 (Xinhua) (Reporter Zhang Qi)The Second Session“Chinese Bridge”The results of the World Primary School Chinese Show Ireland competition have been announced recently. Ella, a fifth grade elementary school student from Dublin City·Gorman won the first prize.'
},
'meta': {},
'created_at': '2022-07-12T07:05:06.793411Z',
'updated_at': '2022-07-12T07:05:06.793424Z',
'is_labeled': False,
'overlap': 1,
'inner_id': 6,
'total_annotations': 0,
'cancelled_annotations': 0,
'total_predictions': 0,
'project': 2,
'updated_by': None,
'file_upload': 2,
'annotations': [],
'predictions': []
}

The specific format can be found in label studio Click on the data list "show task source" see:

adopt Taskflow Making predictions requires starting from ['data']['text'] Extract the original text from the fields. Returned uie The prediction result format is as follows:

{
'place name': [{
'text': 'Ireland',
'start': 34,
'end': 37,
'probability': 0.9999107139090313
}, {
'text': 'Dublin City',
'start': 50,
'end': 54,
'probability': 0.9997840536235998
}, {
'text': 'Dublin',
'start': 3,
'end': 6,
'probability': 0.9999684097596173
}],
'name': [{
'text': 'Ella·Gorman',
'start': 62,
'end': 68,
'probability': 0.9999879598978225
}, {
'text': 'Zhang Qi',
'start': 15,
'end': 17,
'probability': 0.9999905824882092
}],
'organization': [{
'text': 'Xinhua News Agency',
'start': 0,
'end': 3,
'probability': 0.999975681447097
}],
'time': [{
'text': '6 month 28 day',
'start': 6,
'end': 11,
'probability': 0.9997071721989244
}, {
'text': 'Recently',
'start': 43,
'end': 45,
'probability': 0.9999804497706464
}]
}

from uie Extract corresponding fields from the predicted results, constitute Label Studio Accepted pre annotation format. Specific pre annotation examples for named entity recognition tasks can be referred to Import span pre-annotations for text.

For more specific pre annotation examples of other types of tasks, please refer to them Specific examples for pre-annotations.

def predict(self, tasks, **kwargs):
from_name = self.from_name
to_name = self.to_name
model = self.model

predictions = []
# loop every task
for task in tasks:
# print("predict task:", task)
text = task['data'][self.value]
uie = model(text)[0]
# print("uie:", uie)

result = []
scores = []
for key in uie:
for item in uie [key]:
result.append({
'from_name': from_name,
'to_name': to_name,
'type': 'labels',
'value': {
'start': item['start'],
'end': item['end'],
'score': item['probability'],
'text': item['text'],
'labels': [key]
}
})
scores.append(item['probability'])
result = sorted(result, key=lambda k: k["value"]["start"])
mean_score = np.mean(scores) if len(scores) > 0 else 0

predictions.append({
'result': result,
# optionally you can include prediction scores that you can use to sort the tasks and do active learning
'score': float(mean_score),
'model_version': 'uie-ner'
})
return predictions

fit training method

Update the model based on new annotations.

Write code coverage fit() method.fit() Method acceptance JSON The format Label Studio notes And return any one that can store model related information JSON Dictionaries.

def fit(self, annotations, workdir=None, **kwargs):
""" This is where training happens: train your model given list of annotations,
then returns dict with created links and resources
"""
# print("annotations:", annotations)
dataset = convert(annotations)

with open("./doccano_ext.jsonl", "w", encoding="utf-8") as outfile:
for item in dataset:
outline = json.dumps(item, ensure_ascii=False)
outfile.write(outline + "\n")

os.system('python doccano.py \
--doccano_file ./doccano_ext.jsonl \
--task_type "ext" \
--save_dir ./data \
--splits 0.5 0.5 0')

os.system('python finetune.py \
--train_path "./data/train.txt" \
--dev_path "./data/dev.txt" \
--save_dir "./checkpoint" \
--learning_rate 1e-6 \
--batch_size 4 \
--max_seq_len 512 \
--num_epochs 20 \
--model "uie-base" \
--init_from_ckpt "./checkpoint/model_best/model_state.pdparams" \
--seed 1000 \
--logging_steps 10 \
--valid_steps 100 \
--device "gpu"')

return {
'path': workdir
}

Machine Learning integrate

Start-up Machine Learning Backend

Execute the following commands sequentially in the terminal:

# Initialize custom machine learning backend
label-studio-ml init <my_ml_backend> --script <my_ml_backend.py>

# Activate the machine learning backend service
label-studio-ml start <my_ml_backend>

After successful startup. You can see it in the terminal ML The backend URL.

**be careful: ** For different HyperAI Computing power container. The external access links in the red box are different from each other. Directly using the links in this tutorial is invalid. Replace with the link prompted in the terminal. It can also be used localhost Replace one of them IP address.

Add to ML Backend reach Label Studio

After launching the custom machine learning backend. You can add it to Label Studio In the project.

The specific steps are as follows:

  1. click Settings - Machine Learning - Add Model

  2. Fill in the title. ML The backend URL, describe (Optional)Waiting for content

  3. choice Use for interactive preannotations Open the interactive pre annotation function (Optional)

  4. click Validate and Save

If an error occurs. Can be viewed Machine learning troubleshooting. Besides through Label Studio of UI Interface Add ML Beyond the backend, just so so use API add to ML back-end.

Get interactive pre comments

To use the interactive pre annotation feature. Need to add ML Backend When opened Use for interactive preannotations option. If not opened. Clickable Edit Edit. Then click on any data randomly, label studio It will quietly run what was just done ml backend New annotations have been generated.

View pre annotated data. If necessary. Modify the annotations.

In this example. In the pre annotated results "Economic Development Zone" and "Local small hail" Not recognized. The modified or pre annotated results have met expectations, click Submit Submit annotation results.

Training model

After annotating at least one task. You can start training the model now. click Settings - Machine Learning - Start Training Start training.

Then return to startup label-studio-ml-backend The window shows that the training process has started. besides, just so so use API Training model or use webhooks Trigger Training.

Summary

  • Label Studio Provided by Machine Learning Backend Provided a flexible framework to assist manual annotation. We can indeed accelerate through it nlp Annotation of data
  • Label Studio of enterprise Version provided Active Learning The process. However, judging from its description, this process is not perfect, in especial fit part, because Label Studio Underestimated "Train" The time spent. So the process of automatically training every annotation may not be so smooth
  • We did not use it Label Studio Provided by "Auto-Annotation" The function. Because it has the problem of duplicate annotations
  • since Label Studio Provided it with api There are actually many things to play with, coordination webhook Waiting for content may make the annotation and training process more efficient