Train ML models with MLFlow and Ray
In this tutorial, we explore some of the infrastructure and platform requirements for large model training, and to support the training of many models by many teams. We focus specifically on
Follow along at Train ML models with MLFlow and Ray.
Note: this tutorial requires advance reservation of specific hardware! You will need a node with 2 GPUs suitable for model training. You should reserve a 3-hour block for the MLFlow section and a 3-hour block for the Ray section. (They are designed to run independently.)
You can use either:
- a
gpu_mi100
at CHI@TACC (but, make sure the one you select has 2 GPUs), or - a
compute_liqid
at CHI@TACC (again, make sure the one you select has 2 GPUs)
This material is based upon work supported by the National Science Foundation under Grant No. 2230079.
Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.
Download ArchiveDownload an archive containing the files of this artifact.
Download with git
Clone the git repository for this artifact, and checkout the version's commit
git clone https://github.com/teaching-on-testbeds/mltrain-chi
# cd into the created directory
git checkout 098979977bd4579ccbaa6cfa78989bed6f0582d5
Submit feedback through GitHub issues