Accelerate Your Research with NVIDIA H100 GPUs on KVM@TACC
Tips and tricks for making the most of Chameleon's new GPU resources and reservation-based workflow
- June 20, 2025 by
- Cody Hammock
The wait is over—NVIDIA H100 GPUs are now available on KVM@TACC! Whether you're training large language models, running complex simulations, or pushing the boundaries of scientific computing, our new reservation-based VMs for the H100s gives you access to cutting-edge GPU acceleration in a flexible virtual environment. More flexibility with GPUs on KVM is coming soon, including multi-GPU support and fractional GPU access for smaller jobs.
Due to high demand and growing interest in these powerful accelerators across the testbed, we've focused on making our new H100s accessible to as many Chameleon users as possible. These nodes pack a lot of heat. Each node contains a Dell PowerEdge XE9640 equipped with Intel Xeon Platinum processor with 4 NVIDIA HGX H100 GPUs, 1 TB DDR5-4800 RAM, 2x 447.13 GB PCIe NVMe and 1x 3.84 TB PCIe NVMe. The nodes are connected with 1x 25 GbE Ethernet. These GPUs excel at:
- Large language model training and inference
- High-resolution scientific simulations
- Complex deep learning workloads
- Accelerated data analytics and visualization
To maximize the impact of these powerful machines, we've prioritized (for now) a virtual approach over bare metal, as virtualization makes it possible to share these resources among multiple users simultaneously to enable broad access for many GPU-hungry users. Users must reserve a flavor (similar to bare metal) before launching a GPU-enabled VM, with leases lasting up to one week. Time-limited reservations further improve availability by timeboxing usage and allowing more users access over a given period. Non-GPU-enabled VMs remain on-demand with no reservations needed for the time being. But, during the summer, we'll be extending the reservation model to the rest of KVM@TACC. This transition will help ensure fair access to all KVM resources as demand continues to grow.
While time-limited VMs represent a shift in workflow, KVM@TACC offers a number of features that make it well-suited for many common workflows (including CPU and GPU workloads) on Chameleon. Below, we dive into some of the tips and tricks for utilizing KVM and the new GPUs effectively.
Reproducibility and Infrastructure as Code
KVM virtual machines can be provisioned in minutes, making them ideal for rapid experimentation. Whether you're running benchmarks, testing new configurations, or deploying a training cluster, you can get started quickly without waiting for physical hardware.
By using automation tools like OpenStack Heat, Terraform, or Ansible, you can:
- Rebuild environments from scratch between leases
- Share templates with collaborators or students
- Ensure results are repeatable across runs
Even if your final workload runs on bare metal, KVM is an efficient prototyping platform for developing and refining your setup first.
Persistent Storage and Snapshots
Even if your VM lasts only a week, your data doesn't have to. OpenStack Cinder volumes let you store data independently of your virtual machines. Volumes:
- Persist across leases
- Can be attached and detached as needed
- Can be resized or snapshotted for backup and rollback
Snapshots of both VMs and volumes make it easy to pause and resume work across leases. If you need to shut down early or something goes wrong, you can pick up where you left off during your next reservation. This is especially valuable for teaching, debugging, and long-running research pipelines where reliability and repeatability are essential.
Flexible Networking
Another key resource that persists between reservations is your user-defined networking.
With OpenStack Neutron, you can:
- Build custom virtual topologies
- Use floating IPs and load balancers
- Define security groups and firewall rules
Whether you're experimenting with cloud-native architectures, building testbeds for microservices, or teaching networking concepts, KVM's networking capabilities enable realistic, reusable environments—even within a short lease.
Ideal for Education
KVM@TACC is well-suited to classroom use. Instructors can provide students with isolated virtual machines, reproducible lab environments, and the freedom to explore and recover from mistakes. A one-week lease is typically more than sufficient for assignments or short-term projects, and snapshots or automated rebuilds make it easy to start over if needed.
This setup supports instruction across a range of subjects—from networking and systems to security and DevOps—while giving students hands-on experience with infrastructure tools they'll encounter in the field.
Getting Started with GPU-Enabled VMs
Ready to launch your first GPU-enabled virtual machine? We've made it easy to get started:
- Follow our hands-on tutorial - Visit our Trovi sharing portal for an interactive walkthrough that covers:
- Making your first GPU reservation
- Launching a VM with GPU support
- Verifying GPU access and running sample workloads
- Best practices for managing your one-week lease
- Dive into the documentation - For detailed technical information about the reservation system, check out the official Chameleon documentation on reservations.
- Plan your workflow - Remember that leases last up to one week, so familiarize yourself with our persistence features (covered above) to ensure your work continues smoothly between reservations.
In short, the reservation-based model introduces a limited VM lifetime, but KVM@TACC continues to offer a fast, flexible, and resilient environment for research and education. Whether you're launching a one-hour benchmark or a week-long simulation, KVM provides the tools to build quickly, iterate confidently, and preserve what matters most.
Teaching Cloud Computing with Chameleon: Making Complex Concepts Accessible
How Chameleon Cloud Transforms Computer Science Education Across Europe
- May 27, 2025 by
- Massimo Canonico
Teaching cloud computing effectively requires hands-on experience, but establishing local datacenters or using commercial cloud providers presents significant barriers for students. Chameleon Cloud provides the perfect solution, offering real cloud infrastructure experience without access limitations or costs, enabling comprehensive cloud computing education across European universities.
Leveraging New and Improved Chameleon Images
Less Setup, More Science: Streamlined Images with Built-in Tools and Drivers
- May 19, 2025 by
- Paul Marshall
What's the secret ingredient that makes our new Chameleon images so much better? From automatic SSH configuration to built-in rclone support, these aren't your ordinary cloud images. Find out what makes them special.
Faster Multimodal AI, Lower GPU Costs
HiRED: Cutting Inference Costs for Vision-Language Models Through Intelligent Token Selection
- April 29, 2025 by
- Kazi Hasan Ibn Arif
High-resolution Vision-Language Models (VLMs) offer impressive accuracy but come with significant computational costs—processing thousands of tokens per image can consume 5GB of GPU memory and add 15 seconds of latency. The HiRED (High-Resolution Early Dropping) framework addresses this challenge by intelligently selecting only the most informative visual tokens based on attention patterns. By keeping just 20% of tokens, researchers achieved a 4.7× throughput increase and 78% latency reduction while maintaining accuracy across vision tasks. This research, conducted on Chameleon's infrastructure using RTX 6000 and A100 GPUs, demonstrates how thoughtful optimization can make advanced AI more accessible and affordable.
No comments