This blog feature explores 4th year University of Michigan PhD student Peifeng Yu’s research on hyperparameter tuning, presented earlier this month at MLSys21. Learn more about Yu, the hyperparameter tuning engine, and how it can improve your deep learning model training process.
On the current research:
Hyperparameters (HP) are parameters that govern the entire deep learning model training process, e.g. the number of layers, learning rates, or other parameters controlling the model architecture and/or training process. The effectiveness of deep learning models is highly sensitive to hyperparameters. For example, if the learning rate is too large, the model may converge too quickly to a suboptimal solution, whereas a learning rate that is too small can cause the process to get stuck.
Hyperparameter tuning is an optimization technique whereby we look for the best set of hyperparameters that are likely to produce the highest validation accuracy. This can be slow, as one tuning job contains a large group of training trials, each with its own hyperparameter configuration. Each of these configurations is evaluated by training the model. Distributed computation is commonly used to speed up the tuning process.
Hyperparameter tuning algorithms today usually consist of two components: config generation and evaluation strategies. The algorithm directly interacts with the cluster and the execution logic is implicitly defined by the evaluation strategy.
During a tuning job:
A set of hyperparameter configurations are generated
Each configuration, submitted to the cluster as a training trial, are evaluated
The cluster reports back the results
The generation process is updated with the results to guide the future config generation
This method creates unnecessary slowdowns in the tuning process. It lacks complementary execution engines to efficiently leverage distributed computation in the cluster during the evaluation step. While there are recent attempts in individual HP tuning algorithms to better utilize resources, they have their own drawbacks and can not be easily generalized to the whole ecosystem.
On approaching the research challenge:
The trial execution strategy greatly affects the resource efficiency in hyperparameter tuning solutions. Current strategies can result in suboptimal utilization or can be tricky to parallelize on distributed hardware.
We therefore propose to decouple the execution strategy from hyperparameter tuning algorithms into a separate execution engine. This has the following advantages:
By separating the concern between tuning algorithms and execution engines, we allow both components to evolve independently
Resource usage can be optimized, which results in faster tuning speed for any tuning algorithms, benefiting a wider range of applications.
Our proposed execution engine, Fluid, is algorithm- and resource-aware. It coordinates between the cluster and hyperparameter tuning algorithms. By abstracting hyperparameter tuning jobs as a sequence of TrialGroups, Fluid provides a generic high-level interface for hyperparameter tuning algorithms to express their execution requests for training trials. Fluid then automatically schedules the trials considering both the current workload and available resources to improve utilization and speed up the tuning process.
Fluid models the problem as a strip packing problem where each rectangle has different shapes and the goal is to minimize the height of the strip. The intuition behind Fluid is to grant more resources to more demanding or promising configurations, such as those with larger training budgets, higher resource requirements, or higher priority. By combining techniques like elastic training and GPU multiplexing, Fluid is able to change the trial size including scaling out and in a trial across many GPUs and within one respectively. We also propose heuristics to solve the strip packing problem efficiently and prove that their performance is bounded within 2x of the optimal.
With Fluid, the tuning algorithm submits TrialGroup according to its evaluation strategy and gets feedback back anytime they are available. Fluid itself manages the job execution and handles resource changing events from the cluster.
On testbed needs:
GPU clusters of decent size (e.g. at least 4-8 nodes with multiple GPUs on each node to provide some room for the cluster scheduler to act on) are necessary for evaluating both the baseline and our system. Thankfully, Chameleon has various GPU nodes for us to carry out our experiments.
If Chameleon was not available (which actually happened a few times before due to limited number of GPU nodes), we would have to find other providers for GPU clusters, which isn’t easy or may incur large costs.
On one hand, VM-based solutions like AWS may potentially impact the experiment, as we care about performance measurement and there is no interference with bare metal access provided by Chameleon. On the other hand, it isn’t easy to customize the OS and drivers to the exact software environment we want in traditional HPC clusters, which often do not grant users full control over the system.
I’m a 4th year PhD student at the University of Michigan, working with Prof. Mosharaf Chowdhury on interesting problems in systems and networking. I’m especially excited about improving the system infrastructure for deep learning applications. I hope to build software with practical usage and thus I’d prefer to work in the industry after graduation.
During quarantine and working from home, I spend most of my free time on open source projects and playing video games.
On staying motivated through a long research project:
Actually I don’t. I think it’s completely normal to have negative thoughts now and then, especially when there are challenges seemingly unsolvable. But they will be overcomed eventually, just as a matter of how. For me, it’s important to have detailed plans beforehand, so during the depressed time, I just stick to the plan without thinking, basically trusting myself from the earlier. By definition there won’t be negative thoughts if you don’t think about it :) Things will eventually work out and I can find the motivation again afterwards.
On choosing his research direction:
I enjoy coding and building tools. So systems research feels like a natural fit for me.
On his most powerful piece of advice:
Try to be organized about your code. While we all write “research quality” code all the time, taking the effort to improve the engineering side of things can help your code have a real impact in the world.
On related artifacts that will be available for interested users to interact with:
Our paper “Fluid: Resource-Aware Hyperparameter Tuning Engine” is published as part of MLSys21. Related materials (slides, poster, videos) can be found on our group website. Fluid itself is open sourced on our group Github, and can be used to reproduce experiment results. We will also update the repo to include detailed instructions about the experiments soon.