Internal data management
HyperAI Storage structure of data
.
├── <username>
├── codes
│ └── <sourcecode-id>
├── datasets
│ ├── <dataset-id>
│ │ ├── <dataset-version>
│ │ └── <dataset-version>
│ └── <dataset-id>
│ ├── <dataset-version>
│ └── <dataset-version>
└── jobs
├── <job-id>
│ ├── logs
│ └── output
└── <job-id>
├── logs
└── output
The root directory is the username. Then divide it into three directories:
/codes
User uploaded code/datasets
User uploaded dataset/jobs
Storage of all user tasks
among /datasets
Next, we will discuss each dataset id For folder. And save the data content of each version in sequence under it./jobs
Below each job Create a separate folder. Below them are respectively /output
/logs
Two folders. Saved the output and log content of the dataset separately.store in S3 and NFS The data follows this directory structure.
Give a few examples:
- The username is test, id by abcde The directory for the first version of the dataset is
test/datasets/abcde/1
- The username is xushanchuan, id by zk812kadf of job The directory is
xushanchuan/jobs/zk812kadf
, his output The directory isxushanchuan/jobs/zk812kadf/output/
, his logs The directory isxushanchuan/jobs/zk812kadf/logs/
HyperAI With NFS The synchronization method
- When uploading the dataset. The data will first be stored in a temporary directory for decompression. Synchronized to after decompression NFS
- stay job At the beginning. If the bound directory is
/input*
So the corresponding directory will be from NFS with read-only Binding the pattern to the container. - If the dataset is bound to
/output
(Namely, user output)catalogue. You need to first copy the data to/output
The boundvolume
- When the user job During execution. There will be a background process that periodically retrieves data from the working directory (Default every 6 minute. There will be differences in different environments)Synchronize to NFS
- When job At the end of execution, the data will also be synchronized to NFS
How to bypass webpage restrictions and upload large-scale datasets
The default upper limit for dataset upload is 500G (production environment ), IfNeed to upload larger datasets or wish to upload files without compressed filesIt is possible to simulate the interaction process of the dataset on the page side and directly integrate it into the cluster NFS Upload big dataset. Firstly, let me introduce the logic behind our backend uploading data versions. Then introduce how to bypass the upload process.
There are currently two processes for uploading datasets:
- [Create a new version](/docs/data-warehouse/dataset-versions/#Create a new version)
- [Update existing versions](/docs/data-warehouse/dataset-versions/#Update the dataset version of the data)
After uploading the compressed file to the backend service every time. The dataset will mark the current dataset version that needs to be updated or the newly created dataset version as PROCESSING
state. Uploaded zip After decompression, there will be two states: Success or Failure.
-
Success naturally means that there are no issues with the compressed file. It can be used normally now. At this point, the backend storage service will mark the dataset version as
VALID
Simultaneously update the file size in the directory, We will also see new versions and corresponding data sizes on the page. -
There are multiple reasons for failure. But most of the time, the decompression of the compressed file fails. At this point, the data version will be marked as
INVALID
, We will see a failed dataset version on the page.
Regarding the above process. If you wish to go directly to NFS Copying data to bypass front-end uploading can be done using the following methods:
1. Determine the dataset you want to upload id Then upload a small file as a placeholder first
As shown in the above figure. If I wish to upload data to HyperAIs
Under the account id
by hY4p8f0sIMH
Under the dataset. Firstly, we can upload an arbitrary small file to generate a new version.
Only one named is uploaded here placeholder.txt
The text file. Generate a new version under the dataset after successful upload.
2. Upload big data directly from the backend to the directory of the aforementioned dataset version
According to the previous text "HyperAI Storage structure of data" It can be partially known that the directory location of the newly created dataset is hyperai/datasets/hY4p8f0sIMH/1
, We can directly copy large-scale datasets to this location. After copying, the copied file can be displayed on the page. But the size of the dataset has not been updated.
Due to different environments NFS The technical solutions are different from each other. We won't explain how to hang it here anymore NFS Go to the internal network directory.
3. Upload the small file again to update the dataset version size
As shown in the figure. Through again "Upload the dataset to the current directory" Uploading a small file in a certain way will cause the space of the dataset version to be recalculated. After successful upload, refresh the page to see that the dataset size is correct.