|

Bring Your Own Storage (Part 1)


The Rescale platform provides end-to-end file management backed by storage offerings from the major public cloud vendors. This includes optimized client-side transfer tools as well as in-transit and at-rest encryption. In this model, Rescale controls the object store layout and encryption key management. In order to retrieve the decrypted file content, users must use Rescale tooling. While this can be convenient if you are starting from scratch and looking for a totally managed secure solution, one scenario that comes up is how to use the platform with input data that has already been uploaded to the cloud. Another use case that we see is integrating with an existing data pipeline that operates directly on simulation output files sitting in a customer-controlled storage location. For cost and performance reasons it is important to try and keep your compute as close to the storage as possible. One of the benefits of Rescale’s platform is that we support a number of different cloud providers and can bring the compute to any cloud storage accounts that you might already be using.
In this post, we will show how customers can transfer input and output files from a user-specified location instead of using the default Rescale-managed storage. For this example, we’ll focus on Amazon S3, however, a similar approach can be used with any provider. In the following, we will go through the setup of a design of experiments job where the input and output files reside in a customer-controlled bucket. Let’s assume that the bucket is called “my-simulation-data”, the input files are all prefixed “input”, and all output files generated by the parameter sweep should be uploaded to a path prefixed by “output”.
This DOE will run over the HSDI and Pintle Injector examples for CONVERGE CFD found on our support page (https://support.rescale.com/customer/en/portal/articles/2579932-converge-examples) in parallel. Normally, the DOE framework is used to change specific numerical values within an input file but here we will use it to select a completely different input.zips per run.
First, upload the converge input zips (hsdi_fb_2mm.zip and pintle_injector.zip) to the s3://my-simulation-data/input/ directory in s3.
Next, create a file locally called inputs.csv that looks like the following:

s3_input,s3_output
s3://my-simulation-data/input/hsdi_fb_2mm.zip,s3://my-simulation-data/output/hsdi/
s3://my-simulation-data/input/pintle_injector.zip,s3://my-simulation-data/output/pintle/

In order to give Rescale compute nodes to access to the bucket, an IAM policy needs to be created that provides read-access to the input directory and full access to the output directory:

{
   "Version": "2012-10-17",
   "Statement": [
       {
           "Effect": "Allow",
           "Action": [
               "s3:Get*",
               "s3:List*"
           ],
           "Resource": [
               "arn:aws:s3:::my-simulation-data/input",
               "arn:aws:s3:::my-simulation-data/input/*"
           ]
       },
       {
           "Effect": "Allow",
           "Action": [
               "s3:*"
           ],
           "Resource": [
               "arn:aws:s3:::my-simulation-data/output",
               "arn:aws:s3:::my-simulation-data/output/*"
           ]
       }
   ]
}

Note that another way to accomplish this is to setup cross-account access (https://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html). This is a preferable way to configure access if all compute nodes will run in AWS. However, the above approach will work regardless of where the compute nodes are executing.
Now, attach this policy to an IAM user and generate an access key and secret key. This access key and the secret key should then be placed into an AWS config file that you save to a local file:

[default]
aws_access_key_id=XXXXXXX
aws_secret_access_key=XXXXXXXXX
region=us-east-1

Save the above to a file called config.
The last file that needs to be created locally is the run script template. We will reference the s3_input and s3_output variables from the inputs.csv created above in a shell script template file that will be executed for each run. Create a file called run.sh.template that looks like:

#!/bin/bash
export AWS_CONFIG_FILE=config
aws s3 cp ${s3_input} .
unzip *.zip
converge
aws s3 cp . ${s3_output} --recursive --exclude config
rm -rf *

There are a couple things to point out in the above script. Normally, the Rescale platform will automatically unarchive zip files however in this case we need to handle that ourselves since we are bypassing Rescale storage for our inputs. The rm -rf * at the end of the script deletes all of the output files after uploading them to the user-specified S3 location. If we omit this step, then output files will also be uploaded to Rescale storage after the script exits.
Now that the necessary files have been created locally, we can configure a new DOE job on the platform that references them. From the New Job page, change the Job Type to DOE and configure the job as follows:

  1. Input Files: Upload config
  2. Parallel Settings: Select “Use a run definition file” and upload input.csv
  3. Templates: Upload run.sh.template. Use run.sh as the template name
  4. Software: Select converge 2.3.X and set the command to run.sh
  5. Hardware: Onyx, 8 cores per slot, 2 task slots

Submit the job. When the job has completes, all of the output files can be found in the s3://my-simulation-data/output/hsdi/ and s3://my-simulation-data/output/pintle/ directories.
In this DOE setup, the ancillary setup data (eg: the AWS config file, csv file, and run script template) are encrypted and stored in Rescale-managed storage. The meat of the job, the input and output files, are stored in the user-specified buckets.
We do recognize that the above setup requires a little manual work to get configured. One of the things on our roadmap is to provide better integration with customer provided storage accounts. Stay tuned for details!

Similar Posts