Skip to main content

Internal data management

HyperAI Storage structure of data

.
├── <username>
├── codes
│ └── <sourcecode-id>
├── datasets
│ ├── <dataset-id>
│ │ ├── <dataset-version>
│ │ └── <dataset-version>
│ └── <dataset-id>
│ ├── <dataset-version>
│ └── <dataset-version>
└── jobs
├── <job-id>
│ ├── logs
│ └── output
└── <job-id>
├── logs
└── output

The root directory is the username. Then divide it into three directories:

  • /codes User uploaded code
  • /datasets User uploaded dataset
  • /jobs Storage of all user tasks

among /datasets Next, we will discuss each dataset id For folder. And save the data content of each version in sequence under it./jobs Below each job Create a separate folder. Below them are respectively /output /logs Two folders. Saved the output and log content of the dataset separately.store in S3 and NFS The data follows this directory structure.

Give a few examples:

  1. The username is test, id by abcde The directory for the first version of the dataset is test/datasets/abcde/1
  2. The username is xushanchuan, id by zk812kadf of job The directory is xushanchuan/jobs/zk812kadf , his output The directory is xushanchuan/jobs/zk812kadf/output/ , his logs The directory is xushanchuan/jobs/zk812kadf/logs/

HyperAI With NFS The synchronization method

  1. When uploading the dataset. The data will first be stored in a temporary directory for decompression. Synchronized to after decompression NFS
  2. stay job At the beginning. If the bound directory is /input* So the corresponding directory will be from NFS with read-only Binding the pattern to the container.
  3. If the dataset is bound to /output (Namely, user output)catalogue. You need to first copy the data to /output The bound volume
  4. When the user job During execution. There will be a background process that periodically retrieves data from the working directory (Default every 6 minute. There will be differences in different environments)Synchronize to NFS
  5. When job At the end of execution, the data will also be synchronized to NFS

How to bypass webpage restrictions and upload large-scale datasets

The default upper limit for dataset upload is 500G (production environment ), IfNeed to upload larger datasets or wish to upload files without compressed filesIt is possible to simulate the interaction process of the dataset on the page side and directly integrate it into the cluster NFS Upload big dataset. Firstly, let me introduce the logic behind our backend uploading data versions. Then introduce how to bypass the upload process.

There are currently two processes for uploading datasets:

  1. [Create a new version](/docs/data-warehouse/dataset-versions/#Create a new version)
  2. [Update existing versions](/docs/data-warehouse/dataset-versions/#Update the dataset version of the data)

After uploading the compressed file to the backend service every time. The dataset will mark the current dataset version that needs to be updated or the newly created dataset version as PROCESSING state. Uploaded zip After decompression, there will be two states: Success or Failure.

  1. Success naturally means that there are no issues with the compressed file. It can be used normally now. At this point, the backend storage service will mark the dataset version as VALID Simultaneously update the file size in the directory, We will also see new versions and corresponding data sizes on the page.

  2. There are multiple reasons for failure. But most of the time, the decompression of the compressed file fails. At this point, the data version will be marked as INVALID, We will see a failed dataset version on the page.

Regarding the above process. If you wish to go directly to NFS Copying data to bypass front-end uploading can be done using the following methods:

1. Determine the dataset you want to upload id Then upload a small file as a placeholder first

As shown in the above figure. If I wish to upload data to HyperAIs Under the account id by hY4p8f0sIMH Under the dataset. Firstly, we can upload an arbitrary small file to generate a new version.

Only one named is uploaded here placeholder.txt The text file. Generate a new version under the dataset after successful upload.

2. Upload big data directly from the backend to the directory of the aforementioned dataset version

According to the previous text "HyperAI Storage structure of data" It can be partially known that the directory location of the newly created dataset is hyperai/datasets/hY4p8f0sIMH/1, We can directly copy large-scale datasets to this location. After copying, the copied file can be displayed on the page. But the size of the dataset has not been updated.

be careful

Due to different environments NFS The technical solutions are different from each other. We won't explain how to hang it here anymore NFS Go to the internal network directory.

3. Upload the small file again to update the dataset version size

As shown in the figure. Through again "Upload the dataset to the current directory" Uploading a small file in a certain way will cause the space of the dataset version to be recalculated. After successful upload, refresh the page to see that the dataset size is correct.