Model optimizations for serving machine learning models
In this tutorial, we explore some model-level optimizations for model serving:
- graph optimizations
- quantization
- and hardware-specific execution providers, which switch out generic implementations of operations in the graph for hardware-specific optimized implementations
and we will see how these affect the throughput and inference time of a model. Follow along at Model optimizations for serving machine learning models.
Note: this tutorial requires advance reservation of specific hardware! You can use either:
- The
compute_liqid
nodes at CHI@TACC, which have one or two NVIDIA A100 40GB GPUs, and an AMD EPYC 7763 CPU. - The
compute_gigaio
nodes at CHI@UC , which have an NVIDIA A100 80GB GPU, and an AMD EPYC 7763 CPU.
and you will need a 3-hour block of time.
This material is based upon work supported by the National Science Foundation under Grant No. 2230079.
Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.
Download ArchiveDownload an archive containing the files of this artifact.
Download with git
Clone the git repository for this artifact, and checkout the version's commit
git clone https://github.com/teaching-on-testbeds/serve-model-chi.git
# cd into the created directory
git checkout 6e71162a8b1a560dea3aee869a197c45e6fa6ea4
Submit feedback through GitHub issues