How to create a good dataset
Order
When facing a deep learning problem or a scenario that needs to be solved. The most common problem we encounter is“no data”.no data. Data is messy. The data is not labeled. The annotation quality is not high. Before solving this problem. Blocked in front of us first. Regarding such issues. The content of this section. Let's introduce how to create a good dataset for everyone. I hope to provide assistance to all students.
Before starting
Before I create a dataset. What are we going to do? Before starting to create a dataset. We need to answer these questions first:
- What kind of problems do we need to solve in our scenario?
- What kind of data is needed to solve such a problem?
- Is there any publicly available dataset similar to our scenario?
- We are within the same unit. How much data can be collected?
- Annotate data for a unit. How much does it cost?
Production steps
Determine task
Before solving problems and creating datasets. The first step we need to take is to determine what our task is. If we don't have a clear understanding of the task we need to solve. Our future work will be wasted directly.
When determining tasks. Firstly, we need to clarify our scenario. What kind of input is it from. To what kind of output. Clearly defined input and output. We can roughly know. What problem are we facing.
After clarifying the problems faced. Break it down into one or more algorithmic problems. Assuming that these algorithm models are already available. See if we can solve the problem according to the process, if possible. We will start planning how to train the corresponding algorithm. And the dataset required for making algorithm models.
Design data distribution
The scenarios in the dataset are set according to the objectives of the task. Targeting potential recognition targets and scenarios that may arise during the task. I usually consider both Chinese and English. Black and white colors, weather. Content distribution.
Divide the dataset
The training set is used to train algorithm models. Learning the parameter weights of neural networks and other models through input-output analysis. The validation set is used to select the best performing model in the validation data among these intermediate models. The test set is used to evaluate the effectiveness of algorithms trained using the training dataset. For example, using real-life situations:
- The training set is equivalent to a student's textbook: Students master knowledge based on the content in the textbook.
- The validation set is equivalent to a student's assignment: By doing homework, one can learn about the learning situation of different students. The speed of progress is fast or slow.
- The test set is equivalent to the final exam: The exam questions are ones I haven't seen before. Assessing students' ability to draw analogies from one example to another.
After obtaining the original dataset. We need to divide it. A sample cannot appear simultaneously in the training set. Verification set and test set. Normally. We adopt6:2:2The division method. Each section should include as many data scenarios as possible.
Annotate data
The tool for annotating datasets is not the main focus of this article. Below are only some open source projects listed:
Organize dataset format
Annotated dataset. Organize by folder. Suggest not placing a single folder that exceeds1000A file. Then create a metadata file “meta.csv”.
meta.csv
yes HyperAI Meta information file in data standard format. For specific format explanations, please refer to: Introduction to Data Format Specifications
Problems encountered in creating data sets
1. How much data is needed?
All projects are unique. Our ideal situation is that the data is proportional to the number of model parameters10times. The more complex the task. The more data is needed.
2. I already have the dataset. What should we do next?
Don't rush to start. You need to first understand the existing dataset. And we will definitely be able to detect errors, invalid. A chaotic place. Repair them first. The quality of the dataset determines every step of your future machine learning project.
3. What if I don't have enough data?
If there is an open-source dataset with similar scenarios. So it can be combined and used. If the scene is special. Let's first sort out the business scenarios. Make good preparations, collecting data .