Train ML models with MLFlow and Ray
In this tutorial, we explore some of the infrastructure and platform requirements for large model training, and to support the training of many models by many teams. We focus specifically on
Follow along at Train ML models with MLFlow and Ray.
Note: this tutorial requires advance reservation of specific hardware! You will need a node with 2 GPUs suitable for model training. You should reserve a 3-hour block for the MLFlow section and a 3-hour block for the Ray section. (They are designed to run independently.)
You can use either:
- a
gpu_mi100
at CHI@TACC (but, make sure the one you select has 2 GPUs), or - a
compute_liqid
at CHI@TACC (again, make sure the one you select has 2 GPUs)
This material is based upon work supported by the National Science Foundation under Grant No. 2230079.
Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.
Download ArchiveDownload an archive containing the files of this artifact.
Download with git
Clone the git repository for this artifact, and checkout the version's commit
git clone https://github.com/teaching-on-testbeds/mltrain-chi
# cd into the created directory
git checkout 48a9d465d9ae227473b1f54e3b6250768dd42752
Submit feedback through GitHub issues