How to Port your experiments between Chameleon Sites

Best practices for using resources across multiple sites

Chameleon's resources are distributed across several sites, large and small. There are a few reasons you might want to move your work to a different site:
First, while Chameleon has a lot of resources, each site has different hardware available, and if you developed an experiment to use the gpu_a100_pcie nodes at UC, you may want to see how it performs on the gpu_mi100 nodes at TACC instead. 
The second is a matter of availability: many kinds of resource are available at multiple sites, even if the “node_types” don’t match exactly. Nodes with Intel Skylake or Cascade Lake generation cpus, or with Infiniband are available at both UC and TACC, and moving your work lets you take advantage of these when one site is very busy, or has an outage. 

All of your work is fundamentally portable between Chameleon sites. As all of the sites share a common API, and federated authentication, all that you need to do is pick a site, move the data that your experiment needs, and maybe update some names in your scripts.

Finding Resources at each Site:

The first question you might ask is, how can you find similar resources between sites? 
As always, the Chameleon Resource Browser is the “source of truth” for what resources are available at each site. Most “node_types” are unique to a given site, but many are similar, differing only by CPU generation or GPU model.
You can drill down more finely by using the advanced filters instead of node types, as documented here: Resource discovery — Chameleon Cloud Documentation

Glance Images:

Chameleon’s bare metal disk images will work across all Chameleon sites without any changes. In addition, good news: all of our basic “supported images” are already available at each site. If this is what you’re using, you’re already done!

However, if you’re using a custom image, e.g. created with cc-snapshot, you’ll need to copy it between sites yourself. To do so, you’ll need to download it from the first site, then upload it to the second. See the following docs pages for more:

Worked example with the CLI

This example assumes that you have downloaded openrc files from UC and TACC, and changed names as needed. If you have a slow internet connection, you can run this from a chameleon node to speed things up!

#!/bin/bash

# download from UC
source uc-openrc.sh
openstack image save my-snapshot --file my-snapshot.img

# upload to TACC
source tacc-openrc.sh
openstack image create my-snapshot \
  --container-format bare \
  --disk-format qcow2 \
  --file my-snapshot.img

Persistent Data

Your experiment might depend on, or generate, large amounts of data. We provide several tools to help you with this. In particular, CHI@UC and CHI@TACC each host an Object Store, and a Filesystem Share that you can use.

Moving Object Store Data

The object store at either site can be accessed over the WAN from anywhere, so why would you want to move this data? First, you’ll see improved performance if you access an object store “local” to your instances, instead of further away. Second, in the case of an outage, having a copy of your data elsewhere will keep it available for your use.

As with the glance images, you’ll need to download, then upload the data. See the docs at: 

If you have a lot of data, either in large files, or many small files, it will be faster to use a Chameleon instance to do the sync instead of your local machine. This can be done using the CLI, or any swift compatible utility, such as rclone.

Example:

#!/bin/bash

# download from UC
source uc-openrc.sh
openstack object save my-container my-data.tar.gz

# upload to TACC
# creates object my.data.tar.gz in contianer my-container, from source file my-data.tar.gz


source tacc-openrc.sh
openstack object create my-container my-data.tar.gz my-data.tar.gz

Moving Filesystem Data

Unlike the object store, the shared filesystem is site-local. Data stored in the filesystem shares can only be accessed by instances at that site that mount it. Therefore, if you need to move it to a different site, the easiest way is to use an instance to copy the data into the object store first, then follow those instructions to move it to the other site.

I just have a running instance, where do I start?

If you have a lot of data, or state tied up in a running instance, you’ll first need to move it to a more persistent form, before moving it to a different site. We provide a utility, cc-snapshot, to make a new disk image out of a running baremetal instance. However, this comes with a few caveats!

  1. Snapshots are unreliable above 10GB, and if above 20GB, won’t be able to be used to spawn new instances.  Instead, put large datasets into the object store.
  2. Any data mounted from the shared filesystem is site-local. This should also be copied into the object store if you need to use it at a different site.
  3. When you make your snapshot, exclude copies of this data
  4. When you launch a new instance with the snapshot, download or mount the data as needed.

CLI Example with the object store!

This assumes your “big dataset” is at “/scratch/my-data”
1. On the instance you’re “saving”:
#!/bin/bash

openstack container create my-container
tar -czf my-data.tar.gz /scratch/my-data
openstack object create my-container my-data.tar.gz my-data.tar.gz

# exclude the tar file and the dataset when making the snapshot
cc-snapshot -e /scratch/my-data -e my-data.tar.gz my-snapshot

 

2. On the new instance you booted from the new image “my-snapshot”
mkdir /scratch/
openstack object save my-container my-data.tar.gz
tar -xzf my-data.tar.gz /scratch/my-data

Running your work at the new site

Finally, after moving your data to the new site, you can execute your experiments as usual, with the following notes:

  • When you create leases and reservation, be sure to update the hardware selectors if needed
  • When launching instances from custom disk images, you’ll need to refer to whatever name you created when copying the image (if different from original)
  • If you have any scripts and/or Juptyer notebooks that do stuff programmatically, they’ll need to be updated to authenticate to the new site, and to reference the new names or uuids, as above.

Add a comment

No comments