Model optimizations for serving machine learning models

In this tutorial, we explore some model-level optimizations for model serving:

  • graph optimizations
  • quantization
  • and hardware-specific execution providers, which switch out generic implementations of operations in the graph for hardware-specific optimized implementations

and we will see how these affect the throughput and inference time of a model. Follow along at Model optimizations for serving machine learning models.

Note: this tutorial requires advance reservation of specific hardware! You can use either:

  • The compute_liqid nodes at CHI@TACC, which have one or two NVIDIA A100 40GB GPUs, and an AMD EPYC 7763 CPU.
  • The compute_gigaio nodes at CHI@UC , which have an NVIDIA A100 80GB GPU, and an AMD EPYC 7763 CPU.

and you will need a 3-hour block of time.


This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

474 194 183 1 Mar. 13, 2025, 2:20 PM

Authors

Launch on Chameleon

Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.

Download Archive

Download an archive containing the files of this artifact.

Download with git

Clone the git repository for this artifact, and checkout the version's commit

git clone https://github.com/teaching-on-testbeds/serve-model-chi.git
# cd into the created directory
git checkout 6e71162a8b1a560dea3aee869a197c45e6fa6ea4
Feedback

Submit feedback through GitHub issues

Version Stats

474 194 183