Sharing Portal | Chameleon Cloud

Model optimizations for serving machine learning models

In this tutorial, we explore some model-level optimizations for model serving:

graph optimizations
quantization
and hardware-specific execution providers, which switch out generic implementations of operations in the graph for hardware-specific optimized implementations

and we will see how these affect the throughput and inference time of a model. Follow along at Model optimizations for serving machine learning models.

Note: this tutorial requires advance reservation of specific hardware! You can use either:

The compute_liqid nodes at CHI@TACC, which have one or two NVIDIA A100 40GB GPUs, and an AMD EPYC 7763 CPU.
The compute_gigaio nodes at CHI@UC , which have an NVIDIA A100 80GB GPU, and an AMD EPYC 7763 CPU.

and you will need a 3-hour block of time.

This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

593 195 184 1 Mar. 13, 2025, 2:20 PM

education

Authors

Fraida Fund, NYU Tandon School of Engineering (ffund@nyu.edu)

Launch on Chameleon

Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.

Download Archive

Download an archive containing the files of this artifact.

Download with git

Clone the git repository for this artifact, and checkout the version's commit

git clone https://github.com/teaching-on-testbeds/serve-model-chi.git
# cd into the created directory
git checkout 6e71162a8b1a560dea3aee869a197c45e6fa6ea4

Feedback

Submit feedback through GitHub issues

Versions

Version 2025-03-11 Mar. 11, 2025, 5:47 PM

Version Stats

593 195 184