Chameleon Changelog for May 2025
- June 2, 2025 by
- Mark Powers
Dear Chameleon users,
It’s a big month on the testbed, so best jump right into the new features.
New hardware – GPUs on KVM@TACC! We are thrilled to announce preview availability of new hardware: two new nodes, each a Dell PowerEdge XE9640 equipped with Intel Xeon Platinum processor with 4 NVIDIA HGX H100 GPUs, 1 TB DDR5-4800 RAM, 2x 447.13 GB PCIe NVMe and 1x 3.84 TB PCIe NVMe. The nodes are connected with 1x 25 GbE Ethernet – and four more are coming soon! We haven’t updated our hardware browser in our hurry to give you access to those new nodes so the information above is all you get for now – but of course let us know anytime if you need more detail!
The big innovation is that those nodes are available via KVM, significantly extended to support advance reservations and fractional GPU leases. The KVM@TACC you know and love runs on older Haswell series compute nodes, and its flavors let you launch basic compute VMs. Now, if you use a special new flavor, you’ll be able to launch a VM with a H100 GPU slice. At the moment, each slice is 1/4th of the node (meaning 1 full H100 GPU per instance), but soon we will make available instances tied to a fraction of the GPU which means that we will be able to map many instances to one node. This division will make our GPUs more accessible than with bare metal, where the entire node can only be used by one lease at a time.
To ensure that these GPUs can be efficiently shared between users, before you get access to this special GPU flavor you’ll have to create an advance reservation, similar to the workflow for using our bare metal nodes. However, unlike in the bare metal reservations, you’ll reserve a flavor instead of a specific node. Once your lease is active, you can start launching an instance as normal but when selecting the flavor, you’ll see a new option based on the reservation. There is also an availability calendar which lets you see availability of these new nodes over time. For more information about this process, see our KVM documentation. For now, the use of reservations is only possible on this new GPU hardware; the, existing workflows on KVM@TACC still work as they always have (but read below). Since the GPU nodes and flavor reservations on KVM@TACC are very new (and as we explained, we still did not enter them into the resource discovery service), this release should be considered a preview – please, let us know if you have any questions or problems via the Help Desk. Enjoy!
Upcoming KVM@TACC reservations. As per above, for now the older parts of KVM@TACC still work as before, meaning you can launch compute instances using the old flavors. This is going to change later this summer, when we will enable advance reservations for all KVM instances. We announced these plans back in February, but postponed the work to so as not to introduce disruptive changes during the Spring semester. We are making this change for a few reasons. Firstly, we want to ensure consistent workflows between Chameleon sites. KVM@TACC was the only site that didn’t require reservations, while baremetal sites and CHI@Edge did. In addition, “forgotten” KVM instances pose security risks if they are left running without security patches. You still will be able to have long running instances under these changes, but they will be tied to a lease, ensuring that you must renew it periodically and stay an active user. We’ll have more details in next month’s changelog about what exactly these changes will mean to you, but we are expecting that it may require you to relaunch your instances. Stay tuned for more information.
Better docs! Last year, we revamped our Getting Started documentation, and this month we’ve refreshed the rest of the documentation too! Most of the content is the same, but we’ve cleaned up references to old and outdated systems, rearranged sections for clarity, and fixed typos and other issues. Additionally, we’ve added feedback buttons, allowing you to rate the documentation pages. This information will help us guide future documentation work, ensuring we give attention to pages that are confusing users. We’ve also added a button to the bottom of each page which creates a GitHub issue, letting you comment on any specific problem that comes up. If you have any feedback or issues with Chameleon, you are welcome as always to contact us via our Help Desk or the Chameleon Forum.
CHI-in-a-box updates for image deployment. CHI-in-a-box is the software that packages the Chameleon infrastructure, which in turn makes Chameleon’s associate sites like CHI@EVL and CHI@NU possible. Last month, we released brand new OS images, for ARM and AMD ROCm, but since Chameleon sites are federated, we can’t simply “push” these new images to all sites. To solve this problem, we’ve released an image-deployer tool as a part of CHI-in-a-box that associate site operators can run to fetch updated images from our flagship sites, which can manually pull in image updates, or automatically manage image versions in the background. For more information, site operators can see the CHI-in-a-box documentation for how to configure and run this tool.
Happy experimenting!
No comments