Composable Hardware on Chameleon NOW!

Introducing new GigaIO nodes with A100 GPUs

Composable hardware in Chameleon!

We're excited to announce the availability of GigaIO's composable hardware at both CHI@UC and CHI@TACC.

These systems consist of a set of bare metal nodes, accelerators, and a configurable PCIe backplane to connect them. GigaIO's own overview can be found here.

Composable hardware enables more flexibility in how a node is configured. For example, at UC the system currently consists of 8 nodes and 8 A100 GPUs. By default, each node is connected to 1 GPU, but it is possible to connect up to all 8 GPUs to a single node, or any combination in between. This means that we can now support configurations that would have been too expensive before (scaling with 8 GPUs), as well as improving utilization of the available hardware (an experiment with 2 GPUs and one with 6 GPUs can be run at the same time).

What you can do with it!

The backplane is technically a fabric of interconnected PCIe Switches but can be thought of as a larger "Virtual Switch", allowing any of the accelerators to be attached to any node, where it will appear to the OS as if directly connected.

Due to the complexity of reconfiguring not just "GPU 1 to Node 4", but provisioning a path through the switch fabric, we do not have an API for reconfiguration available directly. Instead, each node's "default" configuration is available in the reference-api and for reservation. If you would like to reserve 4 GPUs and 1 node, you'll instead need to reserve all 4 nodes, since they each "come with" one GPU. After making your reservation, please submit a helpdesk ticket, and we will trigger the re-composition of the resources when your reservation starts.

The diagram below shows some example configurations and the corresponding reservations.

Note: The 2x 25G Ethernet connections are not shown.

As compared to the existing LIQID deployment at TACC, the GigaIO systems support host-host and device-device communication, not just host-device. (Host-host is in beta and requires specific kernel support.)

Hardware in CHI@UC:

Servers

8 Dell R6525 servers, with node_type "compute_gigaio"

  • 2x AMD EPYC 7763 64-Core Processors, for a total of 256 threads
  • 512 GB of RAM
  • 2x 480GB SATA SSDs
  • 2x 25G NICs
  • 8x lanes of PCIe 4.0 to the backplane: (128 Gb/s) Half Duplex, (256 Gb/s) Full Duplex

Accelerators

8 Nvidia A100 PCIe GPUs, with:

  • 80 GB RAM
  • 8x lanes of PCIe 4.0 to the backplane: (128 Gb/s) Half Duplex, (256 Gb/s) Full Duplex

Connectivity

Hardware in CHI@TACC:

GigaIO:

6 PowerEdge R650 nodes, of node type "compute_gigaio"

  • Intel Xeon Platinum 8380
  • 256 GB RAM
  • 1x 40 GbE NIC
  • 1x 480 GB SATA SSD
  • 8x lanes of PCIe 4.0 to the backplane: (128 Gb/s) Half Duplex, (256 Gb/s) Full Duplex

4 Nvidia A100-class PCIe GPUs

  • 2x GA100 [A100 PCIe 40GB]
  • 2x GA100GL [A30 PCIe]

LIQID:

8 PowerEdge R6525 nodes, of node type "compute_liqid"

  • AMD EPYC 7763
  • 256 GB RAM
  • 1x 480 GB SATA SSD
  • 16x lanes of PCIe 4.0 to the backplane

10 x GA100 [A100 PCIe 40GB]
16 x Samsung NVMe "Honeybadger" 3839 GB

Chameleon Testbed Secures $12 Million in Funding for Phase 4

Expanding Frontiers in Computer Science Research

We are thrilled to announce that Chameleon, our experimental testbed for computer science research, has been awarded $12 million in funding from the U.S. National Science Foundation (NSF) for its fourth phase. Four more years of Chameleon!

Chameleon Changelog for August 2022

We hope everybody has had a great vacation –  welcome back, we are glad to see you again! In this month’s changelog, we bring you groundbreaking composable Liqid hardware at CHI@TACC, updated categorization for projects, appliance news, and Xena upgrade at CHI@NU. Additionally, we have a reminder for a scheduled outage of our authentication service, and important notes on using our A100 hardware.
 


Add a comment

No comments