Train ML models with MLFlow and Ray

In this tutorial, we explore some of the infrastructure and platform requirements for large model training, and to support the training of many models by many teams. We focus specifically on

  • experiment tracking (using MLFlow)
  • and scheduling training jobs on a GPU cluster (using Ray)

Follow along at Train ML models with MLFlow and Ray.

Note: this tutorial requires advance reservation of specific hardware! You will need a node with 2 GPUs suitable for model training. You should reserve a 3-hour block for the MLFlow section and a 3-hour block for the Ray section. (They are designed to run independently.)

You can use either:

  • a gpu_mi100 at CHI@TACC (but, make sure the one you select has 2 GPUs), or
  • a compute_liqid at CHI@TACC (again, make sure the one you select has 2 GPUs)

This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

1108 191 180 3 Feb. 28, 2025, 3:07 AM

Authors

Launch on Chameleon

Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.

Download Archive

Download an archive containing the files of this artifact.

Download with git

Clone the git repository for this artifact, and checkout the version's commit

git clone https://github.com/teaching-on-testbeds/mltrain-chi
# cd into the created directory
git checkout 48a9d465d9ae227473b1f54e3b6250768dd42752
Feedback

Submit feedback through GitHub issues

Version Stats

1078 186 175