HyperAI Documentation

This tutorial is on be based on UIE Named Entity Recognition On the basis of. Further integration Label Studio of Machine Learning Backend Implement interactive pre annotation and model training functions.

Environmental preparation

stay HyperAI Start a "model training" The container. Environmental selection paddlepaddle-2.3 Resource selection vgpu Or other GPU container
stay Jupyter Open one in the middle Terminal window. Then execute the command HyperAI-label-studio start-up label-studio

Open the link in the red box in the browser. Register an account and log in
Open another one Terminal Execute the following command on the window, install label_studio_ml
```
pip install label_studio_ml
pip uninstall attr
```

Machine Learning Backend to write

complete Machine Learning Backend see my_ml_backend.py file. For more information on custom machine learning backend development, please refer to Write your own ML backend.

Simply put, my_ml_backend.py It mainly includes an inheritance from LabelStudioMLBase The class. Its content can be divided into the following three main parts:

__init__ method. Includes loading of models and initialization of basic configurations
predict method. Used to generate new prediction results for annotated data. Its key parameters tasks namely label studio The raw data transmitted
fit method. Used for model training. When clicking on the page Train When pressing the button. This method will be called (The specific location will be mentioned in the following text), Its key parameters annotations namely label studio The annotated data that has been transmitted

`init` Initialization method

Stay __init__ Define and initialize the required variables in the method.LabelStudioMLBase The class provides the following Several special variables available for use:

self.label_config: Original label configuration.
self.parsed_label_config: Provide structured solutions for the project Label Studio Label configuration.
self.train_output: Include the results of previous model training runs. Defined in the training call section fit() The output of the method is the same.

As shown in the examples in this tutorial. Label configuration is:

<View>
  <Labels name="label" toName="text">
  <Label value="place name" background="#FFA39E"/>
  <Label value="name" background="#D4380D"/>
  <Label value="organization" background="#FFC069"/>
  <Label value="time" background="#AD8B00"/>
  <Label value="product" background="#D3F261"/>
  <Label value="Price" background="#389E0D"/>
  <Label value="weather" background="#5CDBD3"/>
  </Labels>
  <Text name="text" value="$text"/>
</View>

Correspondingly parsed_label_config As shown below:

{
	'label': {
		'type': 'Labels',
		'to_name': ['text'],
		'inputs': [{
			'type': 'Text',
			'value': 'text'
		}],
		'labels': ['place name', 'name', 'organization', 'time', 'product', 'Price', 'weather'],
		'labels_attrs': {
			'place name': {
				'value': 'place name',
				'background': '#FFA39E'
			},
			'name': {
				'value': 'name',
				'background': '#D4380D'
			},
			'organization': {
				'value': 'organization',
				'background': '#FFC069'
			},
			'time': {
				'value': 'time',
				'background': '#AD8B00'
			},
			'product': {
				'value': 'product',
				'background': '#D3F261'
			},
			'Price': {
				'value': 'Price',
				'background': '#389E0D'
			},
			'weather': {
				'value': 'weather',
				'background': '#5CDBD3'
			}
		}
	}
}

According to needs, from self.parsed_label_config Extract the required information from variables. And through PaddleNLP of Taskflow Load the model for pre annotation.

def __init__(self, **kwargs):
    # don't forget to initialize base class...
    super(MyModel, self).__init__(**kwargs)

    # print("parsed_label_config:", self.parsed_label_config)
    self.from_name, self.info = list(self.parsed_label_config.items())[0]

    assert self.info['type'] == 'Labels'
    assert self.info['inputs'][0]['type'] == 'Text'

    self.to_name = self.info['to_name'][0]
    self.value = self.info['inputs'][0]['value']
    self.labels = list(self.info['labels'])
    # init uie model
    self.model = Taskflow("information_extraction", schema=self.labels, task_path= './checkpoint/model_best')

`predict` Prediction method

Write code coverage predict(tasks, **kwargs) method.predict() Method acceptance JSON The format Label Studio task And with Label Studio Accepted format Return prediction.in addition. It can also include and customize predictive scores that can be used for active learning loops.

tasks The parameters contain detailed information about the task to be pre annotated.concrete task The format is as follows:

{
	'id': 16,
	'data': {
		'text': 'Xinhua News Agency Dublin 6 month 28 (Xinhua) (Reporter Zhang Qi)The Second Session“Chinese Bridge”The results of the World Primary School Chinese Show Ireland competition have been announced recently. Ella, a fifth grade elementary school student from Dublin City·Gorman won the first prize.'
	},
	'meta': {},
	'created_at': '2022-07-12T07:05:06.793411Z',
	'updated_at': '2022-07-12T07:05:06.793424Z',
	'is_labeled': False,
	'overlap': 1,
	'inner_id': 6,
	'total_annotations': 0,
	'cancelled_annotations': 0,
	'total_predictions': 0,
	'project': 2,
	'updated_by': None,
	'file_upload': 2,
	'annotations': [],
	'predictions': []
}

The specific format can be found in label studio Click on the data list "show task source" see:

adopt Taskflow Making predictions requires starting from ['data']['text'] Extract the original text from the fields. Returned uie The prediction result format is as follows:

{
	'place name': [{
		'text': 'Ireland',
		'start': 34,
		'end': 37,
		'probability': 0.9999107139090313
	}, {
		'text': 'Dublin City',
		'start': 50,
		'end': 54,
		'probability': 0.9997840536235998
	}, {
		'text': 'Dublin',
		'start': 3,
		'end': 6,
		'probability': 0.9999684097596173
	}],
	'name': [{
		'text': 'Ella·Gorman',
		'start': 62,
		'end': 68,
		'probability': 0.9999879598978225
	}, {
		'text': 'Zhang Qi',
		'start': 15,
		'end': 17,
		'probability': 0.9999905824882092
	}],
	'organization': [{
		'text': 'Xinhua News Agency',
		'start': 0,
		'end': 3,
		'probability': 0.999975681447097
	}],
	'time': [{
		'text': '6 month 28 day',
		'start': 6,
		'end': 11,
		'probability': 0.9997071721989244
	}, {
		'text': 'Recently',
		'start': 43,
		'end': 45,
		'probability': 0.9999804497706464
	}]
}

from uie Extract corresponding fields from the predicted results, constitute Label Studio Accepted pre annotation format. Specific pre annotation examples for named entity recognition tasks can be referred to Import span pre-annotations for text.

For more specific pre annotation examples of other types of tasks, please refer to them Specific examples for pre-annotations.

def predict(self, tasks, **kwargs):
    from_name = self.from_name
    to_name = self.to_name
    model = self.model

    predictions = []
    # loop every task
    for task in tasks:
        # print("predict task:", task)
        text = task['data'][self.value]
        uie = model(text)[0]
        # print("uie:", uie)

        result = []
        scores = []
        for key in uie:
            for item in uie [key]:
                result.append({
                    'from_name': from_name,
                    'to_name': to_name,
                    'type': 'labels',
                    'value': {
                        'start': item['start'],
                        'end': item['end'],
                        'score': item['probability'],
                        'text': item['text'],
                        'labels': [key]
                    }
                })
                scores.append(item['probability'])
        result = sorted(result, key=lambda k: k["value"]["start"])
        mean_score = np.mean(scores) if len(scores) > 0 else 0

        predictions.append({
            'result': result,
            # optionally you can include prediction scores that you can use to sort the tasks and do active learning
            'score': float(mean_score),
            'model_version': 'uie-ner'
        })
    return predictions

`fit` training method

Update the model based on new annotations.

Write code coverage fit() method.fit() Method acceptance JSON The format Label Studio notes And return any one that can store model related information JSON Dictionaries.

def fit(self, annotations, workdir=None, **kwargs):
    """ This is where training happens: train your model given list of annotations,
        then returns dict with created links and resources
    """
    # print("annotations:", annotations)
    dataset = convert(annotations)

    with open("./doccano_ext.jsonl", "w", encoding="utf-8") as outfile:
        for item in dataset:
            outline = json.dumps(item, ensure_ascii=False)
            outfile.write(outline + "\n")

    os.system('python doccano.py \
        --doccano_file ./doccano_ext.jsonl \
        --task_type "ext" \
        --save_dir ./data \
        --splits 0.5 0.5 0')

    os.system('python finetune.py \
        --train_path "./data/train.txt" \
        --dev_path "./data/dev.txt" \
        --save_dir "./checkpoint" \
        --learning_rate 1e-6 \
        --batch_size 4 \
        --max_seq_len 512 \
        --num_epochs 20 \
        --model "uie-base" \
        --init_from_ckpt "./checkpoint/model_best/model_state.pdparams" \
        --seed 1000 \
        --logging_steps 10 \
        --valid_steps 100 \
        --device "gpu"')

    return {
        'path': workdir
    }

Machine Learning integrate

Start-up Machine Learning Backend

Execute the following commands sequentially in the terminal:

# Initialize custom machine learning backend
label-studio-ml init <my_ml_backend> --script <my_ml_backend.py>

# Activate the machine learning backend service
label-studio-ml start <my_ml_backend>

After successful startup. You can see it in the terminal ML The backend URL.

**be careful: ** For different HyperAI Computing power container. The external access links in the red box are different from each other. Directly using the links in this tutorial is invalid. Replace with the link prompted in the terminal. It can also be used localhost Replace one of them IP address.

Add to ML Backend reach Label Studio

After launching the custom machine learning backend. You can add it to Label Studio In the project.

The specific steps are as follows:

click Settings - Machine Learning - Add Model
Fill in the title. ML The backend URL, describe (Optional)Waiting for content
choice Use for interactive preannotations Open the interactive pre annotation function (Optional)
click Validate and Save

If an error occurs. Can be viewed Machine learning troubleshooting. Besides through Label Studio of UI Interface Add ML Beyond the backend, just so so use API add to ML back-end.

Get interactive pre comments

To use the interactive pre annotation feature. Need to add ML Backend When opened Use for interactive preannotations option. If not opened. Clickable Edit Edit. Then click on any data randomly, label studio It will quietly run what was just done ml backend New annotations have been generated.

View pre annotated data. If necessary. Modify the annotations.

In this example. In the pre annotated results "Economic Development Zone" and "Local small hail" Not recognized. The modified or pre annotated results have met expectations, click Submit Submit annotation results.

Training model

After annotating at least one task. You can start training the model now. click Settings - Machine Learning - Start Training Start training.

Then return to startup label-studio-ml-backend The window shows that the training process has started. besides, just so so use API Training model or use webhooks Trigger Training.

Summary

Label Studio Provided by Machine Learning Backend Provided a flexible framework to assist manual annotation. We can indeed accelerate through it nlp Annotation of data
Label Studio of enterprise Version provided Active Learning The process. However, judging from its description, this process is not perfect, in especial fit part, because Label Studio Underestimated "Train" The time spent. So the process of automatically training every annotation may not be so smooth
We did not use it Label Studio Provided by "Auto-Annotation" The function. Because it has the problem of duplicate annotations
since Label Studio Provided it with api There are actually many things to play with, coordination webhook Waiting for content may make the annotation and training process more efficient

Environmental preparation​

Machine Learning Backend to write​

__init__ Initialization method​

predict Prediction method​

fit training method​

Machine Learning integrate​

Start-up Machine Learning Backend​

Add to ML Backend reach Label Studio​

Get interactive pre comments​

Training model​

Summary​