Chameleon Changelog for August 2023
- Sept. 1, 2023
Dear Chameleon users,
This month, we bring more exciting updates to Chameleon.
Introducing FOUNT Project! Last month, we announced the REPETO project to foster practical reproducibility on open platforms. This month, we are excited to announce the FOUNT project, which aims to develop scaffolded, education modules on open clouds and testbeds. These courselets are developed with literate programming, enabling students to explore their contents hands-on. Right now, there is one educational artifact published on Trovi, which showcases a module using self-driving cars to teach machine learning concepts. If you are an educator, and have developed educational modules, you can submit them to this form. You can join the mailing list and follow on Twitter or LinkedIn to stay up to date with the project. Please feel free to join in the discussion.
Ice Lake nodes at CHI@TACC. Last month we announced 20 new nodes on CHI@TACC. We’ve added even more this month, with 52 new Ice Lake nodes. These nodes are Dell R750 servers, each with 2x Intel Platinum 8380 CPUs, for a total of 80 cores and 160 threads, as well as 256GB of ram. They’re configured with 2x 25G ethernet interfaces, and a 480GB SATA SSD for the boot drive. Additonally, the compute_icelake_r650 nodes have Mellanox ConnectX-6 cards for their network-interfaces, and can be used as SmartNICs by advanced users. You can find more information about them in the hardware browser.
CHI-in-a-Box hardware management improvements. CHI-in-a-Box packages the Chameleon infrastructure into a portable tool that powers associate sites like CHI@NCAR and CHI@EVL. This month, we updated the hardware management tool, Doni. Now, you can use names instead of UUIDs when working with resources, and see worker status and last errors in the list view. This should improve the operator experience, especially when enrolling large numbers of nodes.
Improved multi-node launches. If you ever tried to launch multiple instances at once on our baremetal sites, you may have seen a timeout error occur. This was caused due to slow reconfiguration in the switch vendor’s OS, but we’ve worked around this by batching multiple requests at once. We’ve tested launches of 30+ nodes, so you should be able to launch any number of instances you like at once, provided you have an active lease for the nodes.
Updated A100 image. Users reported issues with networking on the A100 NVLink node at UC. This was caused by a bug in Intel’s ICE driver. We’ve updated CC-Ubuntu20.04-CUDA and CC-Ubuntu22.04-CUDA with newer kernels to fix this issue.
Happy experimenting!
No comments