System optimizations for serving machine learning models

In this tutorial, we explore some system-level optimizations for model serving. You will:

  • learn how to wrap a model in an HTTP endpoint using FastAPI
  • and explore system-level optimizations for model serving, including concurrency and batching, in Triton Inference Server

Follow along at System optimizations for serving machine learning models.

Note: this tutorial requires advance reservation of specific hardware! You should reserve:

  • A gpu_p100 node at CHI@TACC, which has two NVIDIA P100 GPUs

and you will need a 3-hour block of time.


This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

398 193 179 2 Mar. 17, 2025, 1:06 PM

Authors

Launch on Chameleon

Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.

Download Archive

Download an archive containing the files of this artifact.

Download with git

Clone the git repository for this artifact, and checkout the version's commit

git clone https://github.com/teaching-on-testbeds/serve-system-chi/
# cd into the created directory
git checkout 9219d86fdd9f11d6aa8a61c0c0cd72f338f9dd5c
Feedback

Submit feedback through GitHub issues

Version Stats

336 162 153